Nasr Ullah and Philip K. Brownfield, Guest Editors 


It’s not every day that three world technology leaders join together to 
define a new standard in computing architecture. But that’s exactly 
what Apple, IBM, and Motorola have done since combining forces in 
the PowerPC alliance. At the Somerset design facility in Austin, Texas, 
engineers from each of the three companies design PowerPC micro- 
processors, employing a formal VLSI design methodology drawn from 
the best of IBM’s and Motorola’s development environments. With 
designs completed on schedule for the first two PowerPC family mem- 
bers — the PowerPC 601 and PowerPC 603 microprocessors — the com- 
puting industry can expect a steady stream of PowerPC microprocessors 
satisfying a broad range of market requirements to be introduced in 
the years to come. 

This month’s issue of Communications covers a wide range of topics 
surrounding the PowerPC technology, beginning with a description by 
Moore and Stanphill of the origins of the Apple-IBM-Motorola alliance 
itself and an overview of the first generation of four distinct micro- 
processors designed at Somerset. 

The PowerPC architects paid particular attention to defining 
PowerPC Architecture details, making it flexible and scalable, thus pro- 
viding plenty of room for growth across a varied family of processors 
suitable for a wide range of designs — from high-performance, multipro- 
cessing systems to handheld computing devices. In addition, the defin- 
ition of the architecture ensures a guaranteed common software 
environment for operating system and applications software developers. 
Diefendorff describes the history of PowerPC Architecture evolution. 

Design goals for the PowerPC 603 microprocessor were to deliver 
high-performance RISC computing with power consumption low 
enough for portable laptop computing. Burgess et al. describe features 
of the PowerPC 603 implementation; while Suessmith and Paap focus 
their attention to details of the power management features of the 
PowerPC 603 microprocessor. Determining the impact of implemen- 
tation choices on performance is an important element of modern 
processor design. Poursepanj describes the methodology for perfor- 
mance modeling used by Somerset PowerPC designers. 

The needs of computer systems developers do not end with the 
processor silicon. High-quality compilers are equally important to the 
creation of high-performance systems. Shipnes and Phillip describe the 
modular approach to the organization of Motorola’s optimized 
PowerPC compilers. Increasingly important is the need to begin soft- 
ware development and testing under software simulation before system 
hardware is available, and Anderson provides an overview of Motorola’s 
PowerPC simulator family. 

The alliance companies, together and with their other partners, con- 
tinue to develop the other elements needed in a complete PowerPC 
computer system. A preliminary version of the PowerPC Reference 
Platform Specification has been distributed to computer system devel- 
opers. Operating systems such as AIX, Macintosh System 7, Windows 
NT, Workplace OS, and others are already or will become available for 
PowerPC systems; as well as chipsets and peripheral devices from IBM, 
Motorola, and other silicon providers. 

A great deal of effort lies behind the creation of this special issue. 
Special thanks goes to Myhong Nguyenphu of IBM for her immense 
help in organizing and coordinating the IBM contributions to this 
issue. We also wish to thank the members of the IBM and Motorola 
technical community who contributed their time and energy to write 
and review the articles that follow, as well as the ACM article reviewers 
who provided many useful suggestions. 


COMMUNICATIONS OP THE ACM 



22 June 1994/Vol.37, N 0.6 







1994/V61.37, No.6 23 



l ies R. Moore and Russell C. Stanphill 



: n the design of 
lied on the Set 
: . er the relational 
:*. een high-level lan- 
md lending itself 
: ng of information. 


: ; .5 In October 1994 

_ Pierangela 
Mariagrazia Fug ini, 


issues in database 
_ cial or future sys- 
rr integrity and con- 
v alternative security 
eluding those for 
led, frame-based 


; rtunity to order 
available to the 
iuctory Rate... 



n 


k ACM. Inc. and attach to this 
be v reign currency conversion rates. 

Q MasterCard 

Exp Date 


■ European Service Center 
ke Marcel Thiry 204 
> Brussa Is, Belgium 
Lc 32 2 774 9602 
fr 2 774 9690 
k ACM_EUROPE@ACM.org 




the PowerPC Alliance 

Bi September 1991, IBM, Apple and Motorola announced a broad alliance to jointly 
■sue the development of some of the following emerging technologies. The alliance 
rii. :Ied five principle initiatives: 


Ok-~: :i-oriented technology. Apple 
IBM agreed to form Taligent, an 
Heodent company, to develop 
cense a future operating system 
on object-oriented design prin- 

. : .media technology. Apple and 
c^rreed to form Kaleida, a joint 
re company, to create and li- 
multimedia technologies, 
i h*er connectivity and networking. 
1 and Apple agreed to develop 
tftvare and software solutions that 
r d their systems to interact more 
fcxiv ely. 

Wfen systems environment. Apple, 
M and Motorola agreed to jointly 
ir e the PowerOpen™ environ- 
PowerOpen supports both 
X and Macintosh applications, 

! ill be licensed to other vendors. 

coprocessor technology. IBM 
: Motorola agreed to jointly de- 
% a broad family of microproces- 
based on a derivative of IBM’s 
V iR™ architecture called the 
* rr PC architecture.™ 

Central to this alliance is the com- 
morient to' the PowerPC Architecture 
a family of microprocessors. Es- 


sentially, the companies have agreed 
to a vision that brings the architecture 
to the forefront of the computing 
world, and enables a broad range of 
PowerPC microprocessor-based sys- 
tems. This article will primarily focus 
on the PowerPC microprocessor tech- 
nology aspects of the alliance. 

PowerPC Microprocessors 

In addition to a commitment to the 
PowerPC Architecture, the alliance 
also calls for the development of a 
family of single-chip PowerPC micro- 
processor implementations. This ini- 
tial family of devices is charged with 
providing superior solutions for sys- 
tem designers across a very wide 
range of system design points. As a 
result, the microprocessors must be 
general-purpose and offer versatile 
interfaces. The economics of the com- 
puter industry today demand that 
these designs be relatively inexpen- 
sive and allow for inexpensive 
system-level solutions. In addition, 
although the various members of the 
family may require specific optimiza- 
tion trade-offs for particular market 
segments, they must all exploit the 
inherent advantages of the PowerPC 


Architecture to achieve high perfor- 
mance. 

To develop these devices, IBM and 
Motorola have jointly formed a 
PowerPC design center in Austin 
called Somerset. This jointly man- 
aged center is staffed with approxi- 
mately 300 experienced microproces- 
sor designers. A design methodology 
has been established that allows the 
team to achieve high-quality, aggres- 
sive designs on relatively short devel- 
opment cycles. This is achieved by 
blending tools that enhance designer 
productivity with tools that allow for 
full custom design and analysis. The 
tool set allows the designs to be fabri- 
cated by either IBM or Motorola 
manufacturing facilities using a com- 
mon half-micron CMOS process. 

PowerPC Architecture 

The PowerPC Architecture is a third- 
generation RISC architecture that 
has been optimized for the diverse 
computing requirements of the fu- 
ture [3]. It has been jointly adopted 
by IBM, Motorola and Apple. The 
collective experience of these compa- 
nies provided a firm foundation for 
making appropriate architectural 


communications of the acm June 1994/V61.37, No.6 25 


n g of the PowerPC 


'he partners agreed to an initial microprocessor 


roadmap that specifies four independent design points. 
Each design is challenged with bringing 


industry leadership solutions to their specific segments 


of the computing market. 


trade-offs. The powerful new archi- 
tecture embraces fundamental con- 
cepts of simplicity and general appli- 
cability while extending itself to 
advanced techniques that will carry it 
into the next decade and beyond. 

There were several goals for the 
PowerPC Architecture. First, it was 
important that the architecture main- 
tain an application binary interface 
(ABI) compatible with IBM’s 
POWER architecture. This allows 
PowerPC microprocessor-based ma- 
chines to leverage the application 
base that exists for IBM’s RISC 
System/6000™ machines. To this end, 
the user-level instruction set and pro- 
gramming model of POWER was se- 
lected as a starting point for the 
PowerPC Architecture. Although 
some instructions were ultimately 
added and others deleted, these 
changes can effectively be managed 
by compilers and operating systems. 

The second goal was to simplify the 
architecture and ease unnecessary 
implementation requirements. This 
flexibility allows implementers to 
make optimizations that are appro- 
priate for specific market targets. In 
addition, the simplifications allow for 
smaller chip sizes, faster cycle times, 
and more aggressive superscalar 
implementations. 

A third objective of the PowerPC 
Architecture was to provide support 
for a wide range of uniprocessor and 
multiprocessor system configura- 
tions. This objective was achieved by 
recognizing key abstractions of the 
storage hierarchy and defining the 
storage control architecture to allow 
effective management of these ab- 
stractions. Furthermore, the architec- 


ture allows storage references to fol- 
low either a big-endian or a little- 
endian byte ordering convention. 

Finally, the PowerPC Architecture 
was defined with a set of 64-bit exten- 
sions that allow for upward compati- 
bility of 32-bit applications. This was 
achieved by defining the 64-bit in- 
struction operation as a logical exten- 
sion to the 32-bit execution model. 
The memory management architec- 
ture was also extended to allow ad- 
dress translation of 64-bit addresses. 
To allow flexibility, each implementa- 
tion can be compliant with either the 
base 32-bit PowerPC Architecture or 
the extended 64-bit architecture. 

The alliance partners agreed to an 
initial microprocessor roadmap (see 
Figure 1) that specifies four indepen- 
dent design points. Each of these de- 
signs are challenged with bringing 
industry leadership solutions to their 
specific segments of the computing 
market. The initial four design points 
include the following: 


J 


0 

Q 

High performance 

c 

03 

£ 


0 

Mainstream 

0 

Q_ 

^ © 

Farlv L0W P° Wer ’ 
support lowcost 


1992 1 1993 1 1994 1 


Figure i. PowerPC roadmap 


The PowerPC 601™ microproces- 
sor is the first member of the family 
and is responsible for bringing 
PowerPC Architecture to the market 
as early as possible [2]. It is targeted 
for early PowerPC adopters and is 
suited for use in desktop computers, 
portable systems, and low-end multi- 
processor systems. The design imple- 
ments the 32-bit PowerPC Architec- 
ture and achieves competitive 
performance at a relatively low cost. 
Processor chips have been available 
since early 1993, and first PowerPC 
601 microprocessor-based systems 
were introduced by IBM in the fall of 
1993. The PowerPC 601 is available 
in 50-, 66-, 80-, and 100MHz versions. 

The PowerPC 603™ processor is 
defined to address the low-end and 
low-cost range of the 32-bit PowerPC 
microprocessor family [1]. This part 
packs desktop performance into an 
85 square millimeter die, and dissi- 
pates less than 3 watts of power at 
80MHz. The chip is intended for use 
in low-end desktop systems and port- 
able systems. This part was an- 
nounced formally in October 1993, 
and is slated for production in 3Q94. 

The PowerPC 604™ micropro- 
cessor is designed for mainstream 
computing environments including 
personal computers, midrange work- 
stations, and multiprocessor systems. 
It is organized to offer superior per- 
formance over the competition, and 
outstanding price/performance. The 
chip is a 32-bit implementation of the 
PowerPC Architecture, and uses ad- 
vanced superscalar design techniques 
to achieve high performance. Sam- 
ples are expected in the third quarter 
of 1994. 


26 June 1994/Vol.37, N0.6 



The PowerPC 620™ processor is 

■esigned to deliver the maximum 
performance achievable with the cur- 
rendy available half-micron CMOS 
process technology. This superscalar 
r^ign implements the full 64-bit 
: werPC Architecture and includes 
m embedded L2 cache controller 
that interfaces to standard static ran- 
: m access memory (SRAM) chips. 
The design is targeted at high-end 
r.rsktop computers, work-group 
servers and transaction processing- 
rased systems. Samples are expected 
in the second half of 1994. 

These four microprocessors will be 
heavily used in future IBM and 
-.ople products, and they will also be 
independently marketed and sold by 
:>th IBM and Motorola for use by 
;:her system developers. The exten- 
sion of the PowerPC microprocessor 
family to the open market is an im- 
oortant and strategic part of the alli- 
ance’s plan to make PowerPC Archi- 
tecture successful. 

Future Direction 

The four microprocessor designs 
identified on the current roadmap 
represent only the start of the 
PowerPC processor development. As 
:echnology continues to advance, 
new opportunities for higher perfor- 
mance, greater levels of integration, 
lower power consumption and lower 
:ost will arise. IBM and Motorola are 
committed to driving these technol- 
ogy advancements into their proces- 
sor development efforts, and to ex- 
tend the product roadmap to 
eventually cover an even greater 
range of computing requirements. 

In addition, over time, the existing 
design points will become usable as 
cores for more fully integrated and 
optimized solutions for particular tar- 
get applications or markets. These 
cores will also be available for custom- 
ers to use in development of their 
own application-specific VLSI solu- 
tions. Finally, relevant aspects of the 
PowerPC Architecture will be driven 
into other market segment opportu- 
nities. For example, in a separate ef- 
fort, IBM is developing a line of em- 
bedded controllers based on portions 
of the PowerPC Architecture. These 
controllers will be suitable for envi- 
ronments that are less demanding 
than the general-purpose computing 


environment, yet still require high 

performance at a very low cost. Q 

References 

1. Kahle, J. and Ogden, D. The PowerPC 
603 microprocessor. IBM RISC System/ 
6000 Technology: Vol II, IBM Corp., 
1993. 

2. Moore, C.R., The PowerPC 601 micro- 
processor. IBM RISC System/6000 Tech- 
nology: Vol. II, IBM Corp., 1993. 

3. Silha, E. PowerPC Architecture: A 
high-performance architecture with a 
history. IBM RISC System/6000 Technol- 
ogy: Vol II, IBM Corp. 1993. 


In this document, the terms PowerPC 601 mi- 
croprocessor and 601; PowerPC 603 micropro- 
cessor and 603; PowerPC 604 microprocessor 
and 604; and PowerPC 620 microprocessor and 
620 are used to denote the various microproces- 
sors from the PowerPC Architecture family. 


Motorola is a trademark of Motorola, Incorpo- 
rated. 


Apple is a trademark of Apple Computer Corpo- 
ration. 


IBM, PowerPC, PowerPC Architecture, 
PowerPC 601, PowerPC 603, PowerPC 604, 
PowerPC 620, POWER, PowerOpen, AIX and 
POWER Architecture are trademarks of Inter- 
national Business Machines Corporation. 

About the Authors: 

CHARLES R. MOORE is senior engineer 
in the RISC System/6000 division of IBM 
Corp. He is currently working in the high- 
end processor development group. His 
research interests include high-perfor- 
mance computer organizations, VLSI de- 
sign methods and strategic planning. Au- 
thor’s Present Address: IBM Corp., 

11400 Burnet Rd, M/S 4305, Austin, TX 
78758-3493; email: cmoore@ibmoto.com 

RUSSELL C. STANPHILL is a Motorola 
veteran, currently serving as codirector of 
the Somerset Design Center. In 1991, 
Stanphill was instrumental in working 
with Apple and IBM to form what is now 
the PowerPC alliance. By early 1992 he 
was elected to manage the resulting Som- 
erset facilities. Author’s Present Address: 
Motorola, Semiconductor Products Sec- 
tor, Microprocessor and Memory Tech- 
nologies Group, 6501 William Cannon 
Drive West, Austin, TX 78735-8598. 


Permission to copy without fee all or part of this 
material is granted provided that the copies are not 
made or distributed for direct commercial advantage, 
the ACM copyright notice and the title of the publi- 
cation and its date appear, and notice is give that 
copying is by permission of the Association for 
Computing Machinery. To copy otherwise, or to 
republish, requires a fee and/or specific permission. 


© ACM 0002-0782/94/0300 $3.50 


Big CASE tool vendors caught 
with their pants down! 

“Object modeling and C++ 
programming, side-by- 
side, always up-to-date.” 

W hat if you could have your OOA/OOD 
model and all of your C++ code continu- 
ously up-to-date, all the time, throughout your 
development effort? 

Consider the possibilities... 

In one window, you see an object model, with 
automatic, semi-automatic, and manual layout 
modes, plus complete view management. Side- 
by-side, in another window, you see fully- 
parsed C++ code. You edit in one window or the 
other. Press a key. Both windows agree with 
each other. Together. 

Or suppose that you are working on a project 
with some existing code. (That’s no surprise; 
who’d consider developing in C++ without some 
off-the-shelf classes?) You read the code in. Hit 
a button. And seconds later, you see an object 
model, automatically laid out for you, ready for 
you to study side-by-side with the C++ code 
itself. Together. 

Or suppose you are building software with 
other people (that’s no surprise either). You 
collaborate with others ana develop software 
with a lot less hassle, because the fully 
integrated configuration management features 
help you keep it all... Together. 

The name of this product? It’s earned the 
name... 



continuously up-to-date 
object modeling and C++ programming 


Key features. Continuously up-to-date object 
modeling and C++ programming, side-by-side, 
so you can work back-and-forth between the 
two (and let the tool keep them in-sync). 

Automatic, semi-automatic, and manual layout 
of object models, so you can feed in existing 
class libraries and quickly see a meaningful 
object model. 

Object modeling view management, including 
view control by C++ construct, regular 
expression, proximity, layer, or directory (you 
control model complexity at all times). 

Fully flexible documentation generation, 
version control, and SQL generation, too. 

“State-of-the-art application development.” 

- Computerworld/Germany 

“You’ve really hit the nail on the head when it 
comes to reverse engineering existing C++ 
code. No other tool comes close to the power 
and capability of Together/C ++.” 

— Russell Rudduck, Perot Systems 

Money-back guarantee. Purchase Together/ 
C++ and try it out risk-free. You get a full 30- 
day money-back guarantee. 

How to order. Order Together/C ++ by 
purchase order, check, or credit card. To order, 
or for more information, please call 
1-800-OOA-2-OOP (1-800-662-2667, 24 hours, 

7 days a week). Or contact: 


Object International, Inc. 

Education - Tools - Consulting 
8140 N.MoPac 4-200 
Austin TX 78759 USA 
1-512-795-0202 -fax 795-0332 
e-mail: object@acm.org 

©1994 Object Inti, Inc. 

All rights reserved. 

“Together” is a trademark of Object Int’l, Inc. 
CACM694 


Outside of North America, contact: 
Object Inti Ltd. 
Eduard-Pfeiffer-Str. 73 
D-70192 Stuttgart, Germany 
++49-711-225-740 fax ++49-711-299-1032 
e-mail: 100034.1370@compuserve.com 

In Great Britain and Ireland, contact: 
Object UK Ltd. 

++44-703-39-990 fax +44-703-399-991 
e-mail: 100024.2720@compuserve.com 


communications or the ACM June 1994/Vol.37, No.6 27 


Keith Diefendorff 



History Off the PowerPC Architecture 

Approximately three years ago Apple, IBM, and Motorola decided to adopt a common 
RISC architecture on which to base their future hardware and software systems. The three 
companies believed the technical advantages of RISC were sufficiently compelling 
that if significant resources could be concentrated on one architecture, it might be 
possible to leverage these advantages into a serious market contender. Thus was born the 
PowerPC Architecture. 


Although confident of the signifi- 
cant advantages of RISC, they also 
realized that market success ulti- 
mately depends on software. To get a 
running start, it seemed sensible to 
begin with a RISC architecture that 
already had a large installed base of 
software. From that perspective, the 
obvious choice was IBM’s successful 
line of RISC System/6000 worksta- 
tions and servers, which are based on 
the company’s POWER architecture. 

The PowerPC Architecture reflects 
the work done by the team of com- 
puter architects from Apple, IBM, 
and Motorola who were tasked with 
the objective of retargeting the archi- 
tecture into a form more suitable for 
high-volume, single-chip micropro- 
cessors. The architects also enhanced 
the architecture with better multipro- 
cessor support features and extended 
it with a 64-bit address capability in 
order to ensure its viability into the 
next century. This article describes 
the evolution to the PowerPC Archi- 


tecture and briefly describes some of 
the changes incorporated relative to 
the POWER architecture. 

The fundamental concepts of 
RISC were developed by John Cocke 
in the mid-1970s at IBM’s T.J. Wat- 
son Research Center, and first em- 
bodied in a machine called the IBM 
801 minicomputer [5]. These ideas 
were further refined and articulated 
by a group at the University of Cali- 
fornia at Berkeley led by David Pat- 
terson, who coined the term “RISC” 
[4]. These early pioneers realized that 
RISC represented a substantive de- 
parture from the then-popular trend 
toward more complex instruction sets 
(embraced by “CISC” architectures 
such as the VAX, 8086, 32000, and 
68000) that promised higher perfor- 
mance, less cost, and faster design 
time. 

Complex instruction set architec- 
tures were primarily motivated by a 
desire to reduce the “semantic gap” 
between the machine language of the 


processor and the high-level lan- 
guages in which people were pro- 
gramming. The theory was that such 
a processor would have to execute 
fewer instructions (have a shorter 
path length) and would, therefore, 
have better performance. The key 
observation underlying RISC, how- 
ever, was that the sequential micro- 
code interpreter required to execute 
these complex instructions intro- 
duced an expensive overhead that 
actually slowed down execution of 
the more frequently occurring simple 
instructions — resulting in a net loss in 
performance. Furthermore, complex 
instructions proved to be a rather 
poor target for compilers that had 
difficulty using them and in many 
cases their use precluded optimizing 
out unnecessary operations. 

With the declining cost of memory 
devices and improved compiler tech- 
nology, it became feasible to consider 
simplifying the instruction set, even 
at the cost of larger code size and 


* 






n 

•a 


d 

d 

« 

■a 

i 

- 

d 


28 June 1994/Vol.37, N 0.6 COMMUNICATIONS OF THE ACM 


Liu I mii I M.kij U J" Mil k a • u 



higher memory bandwidth require- 
ments. The 801 was the first machine 
• implement this strategy. It success- 
dly demonstrated that simplifying 
the instruction set enabled imple- 
mentations with smoother running 
: ubble-free) pipelines that could 
mproach the goal of single-cycle in- 
ruction throughput. It was also dis- 
: vered that other architectural fea- 
m:res now associated with RISC — 
e_:ch as a large uniform register file — 
r.nabled compiler optimizations that 
actually kept code expansion very low 
_nd even reduced data bandwidth 
eiative to existing CISC architec- 
tures. On balance, the 801 demon- 
mated that investing more transis- 
tors in instruction throughput, fast 
tele times, and more registers, pro- 
duced a better solution to the corn- 
outer performance equation than was 
uossible by spending those transistors 
an more complex instructions. 

But the ideas behind RISC turned 
out to be even more significant than 
jriginally thought. Not only did 
RISC processors demonstrate more 
parallelism through better pipelin- 
ing, but the resulting simplification of 
:he hardware made tractable the idea 
of dispatching multiple instructions 
simultaneously (superscalar 1 ) and 
enabled implementation of the here- 
tofore mainframe-domain concepts 
of dynamic instruction reordering 
and out-of-order instruction execu- 
tion on single-chip microprocessors. 
This is the essence of RISC architec- 
ture — it allows the execution of more 
operations in parallel and at a higher 
rate than possible with a CISC archi- 
tecture employing similar implemen- 
tation complexity. 

Satisfied that the 801 concepts had 
made significant improvements in 
instruction cycle times and pipeline 
efficiency, IBM set out to improve 
further on the 801 architecture by: 
1) explicitly embodying the concept 
of superscalar operation in the archi- 
tecture; 2) improving the architec- 
ture as a target for compilers; 3) re- 
ducing instruction path lengths; and 
4) including floating point as a first- 
class data type in the architecture. 


^he term “superscalar” is believed to have 
been coined by T. Agerwala and John Cocke [1]. 
It refers to machines capable of dispatching 
multiple instructions per clock from a conven- 
tional linear instruction stream. 


This effort culminated in the devel- 
opment of the POWER architecture 
[3] in the late 1980s, which now forms 
the basis of IBM’s RISC System/6000 
family of workstations and servers 
(see Figure 1). 

The POWER Architecture 

The POWER architecture is a con- 
ventional RISC architecture in most 
respects; it adheres to the most im- 
portant RISC tenants of fixed-length 
instructions, register-to-register ar- 
chitecture, simple addressing modes, 




simple (not requiring microcode in- 
terpretation) instructions, a large reg- 
ister file, and a three-operand (non- 
destructive) instruction format. 
However, the POWER architecture 
also has several additional features 
that set it apart from other RISC 
architectures. 

First, the instruction set was orga- 
nized around the idea of superscalar 
instruction dispatch. Conceptually, 
instructions are dispatched across 
three independent execution units, a 
branch unit, a fixed-point unit, and a 
floating-point unit (see Figure 2). In- 
structions can be dispatched to each 
of these units simultaneously where 
they can execute concurrently and 
finish out of order. To increase the 
level of instruction parallelism that 
can be achieved in practice, the in- 
struction set architecture defines an 
independent set of register resources 
for each unit. This minimizes the 


Figure i. PowerPC genealogy 

Figure 2. POWER Architecture 
model 

communication and synchronization 
required between units, thus allowing 
execution units to adjust to the dy- 
namic instruction mix by “slipping” 
past one another. Any data communi- 
cation required between units must 
be performed explicitly, exposing it 
to the compiler, where it can be effec- 
tively scheduled. (It is important to 
realize this is a conceptual model 
only. Any given processor may imple- 
ment each of the conceptual units as 
multiple execution units to support 
additional instruction parallelism. 
But the existence of the model led to 
the consistent design of an instruction 
set that naturally supported at least 
degree three parallelism.) 

Second, the POWER architecture 


l June 1994/Vol.37, N 0.6 29 




M Comparison of PowerPC 601 
ana PowerPC 60S features 


PowerPC 601 Microprocessor 


Organization: 


POWER and PowerPC instruction sets 
Superscalar (degree 3) 

Out-of-order dispatch 
Three concurrent execution units 
Branch folding and branch prediction 
Fully pipelined floating point 
32KB, 8-way, copyback cache 
MESl cache coherency 
Dynamic load-store reordering 
64-bit, bursting, split-transaction bus 


Implementation: 


0.6/xm, 4-level metal, CMOS 
2.8 million transistors 
11 x ii mm die size 
50, 66, and 80MHz versions 
3.6V, 6.5 watts @ 50MHz 
C4 in 304 pin QFP 


Performance: 


IBM POWERstation 250 @66MHZ: 
62.6 SPECint92 
72.2 SPECflt92 

Estimated @80MHZ W/1MB L2: 

85 SPECint92 
105 SPECflt92 

PowerPC 603 Microprocessor 


Organization: 


PowerPC instruction sets 
Superscalar (degree 3) 

Out-of-order execution and completion 
Five concurrent execution units 
Register renaming 
Reservation stations 

Branch prediction and speculative execution 
Fully pipelined floating point 
8KB/8KB, 2-way, caches 
Dynamic load-store reordering 
64-bit, bursting, split-transaction bus 


Implementation: 


0.5/Am, 4-level metal, CMOS 
1.6 million transistors 
7.4 x ii.5mm die size 
66, and 80MHz versions 
3.3v, 3 watts @ 80MHz 
240 pin CQFP 


Performance: 


Estimated @80MHZ W/1MB L2: 
75 SPECint92 
85 SPECflt92 


added several “compound” instruc- 
tions to reduce instruction path 
lengths. Perhaps the only drawback to 
RISC technology relative to CISC is 
that it sometimes takes more instruc- 
tions to perform a given task. IBM 
discovered that most of this code ex- 
pansion is avoidable with minor en- 
hancements to the instruction set that 
do not constitute a return to CISC- 
like complex instructions. For exam- 
ple, a large fraction of the code ex- 
pansion was found in the prolog and 
epilog code associated with saving 
and restoring registers across a pro- 
cedure call. To eliminate this as a fac- 
tor, IBM introduced “load-and-store 
multiple” instructions that allow sev- 
eral registers to be moved to or from 
memory with a single instruction. 
The linkage conventions used by the 
POWER compilers addressed the 
problems of relocation, shared librar- 
ies, and dynamic linkage in one sim- 
ple, unified mechanism. This is done 
by indirect addressing through a 
table of contents (TOC) that is up- 
dated at load time. The load-and- 
store multiple instructions were im- 
portant to these linkage conventions. 

Another example of “compound” 
instructions is the optional update of 
the base register on loads and stores 
with the newly calculated effective 
address. This instruction eliminates 
the need for the extra add instruction 
that would otherwise be required to 
increment the index for progressive 
indexing of arrays. Even though this 
is a compound operation, it does not 
adversely affect the conventional 
RISC pipeline flow because the up- 
dated address is already computed 
and a register file write port is nor- 
mally available while waiting on the 
memory operation. 

The POWER architecture pro- 
vided a few other path length reduc- 
ing instructions such as: an extensive 
set of bit-field manipulation instruc- 
tions, compound multiply-add 
floating-point instructions, condition 
register setting as a side effect of nor- 
mal instruction execution; and load 
and store string instructions (which 
load or store arbitrarily aligned byte 
strings). 

A third factor that differentiates 
the POWER architecture from many 
other RISC architectures is the ab- 
sence of the branch-and-execute ca- 


30 June 1994/Vol.37, N 0.6 COMMUNICATIONS OF THE ACM 






The Making of the 


ir.struc- 
l path 
back to 
BSC is 
nstruc- 
IBM 
ede ex- 
ror en- 
>et that 
» CISC- 


■ exam- 
ede ex- 
t ag and 
saving 
t a pro- 
is a fac- 
: d-store 
taw sev- 
er from 
ruction. 

| bv the 
sed the 
c librar- 
cne sim- 
:s done 
ough a 
i is up- 
c ad-and- 
>ere im- 
i rations, 
mound” 
ludate of 
id stores 
effective 
Eminates 
s -.ruction 
juired to 
c ^ressive 
c ugh this 
does not 
i entional 
t the up- 
mputed 
k is nor- 
14 on the 


i pro- 
dr. reduc- 
extensive 
: instruc- 
:ply-add 
c ondition 
let of nor- 
;md load 
Ls (which 
med byte 

erentiates 


_ : m many 
is the ab- 
^ecute ca- 


Satisfying the diverse needs of three major corporations like Apple, 



IBM, and Motorola to meet their collective long-term vision of 


computing required some modifications to the POWER architecture. 


I ability. Branch-and-execute (some- 
t.::ies called delayed branching) 
..rases the instruction following a 
-ranch to execute before the branch 
gets taken. This feature worked effec- 
tr ely in early RISC machines to fill 
die instruction bubble created by 
: ranch evaluation and fetching the 
■ew instruction stream. However, in 
more advanced, superscalar ma- 
chines, this feature is ineffective be- 
muse a single branch delay cycle in- 
duces multiple instruction bubbles 
that cannot all be covered with a sin- 
me architectural delay slot. Almost all 
uch machines will implement exotic 
facilities (e.g., branch target caches) 
I □ r covering these bubbles. These fa- 
mines render the delayed branch 
_>eless. Not only is the delayed 
ranch ineffective in such machines, 

: introduces significant complexity 
into the instruction sequencing logic. 
Thus, even though the 801 employed 
- ranch-and-execute, it was not in- 
Tuded in the POWER architecture. 
Instead the POWER branch architec- 
ture was organized to support 
ranch-lookahead and branch-fold- 
ing techniques as described in the 
next paragraph. 

The branching technique used in 
the POWER architecture is the fourth 
^nique feature of the architecture 
. rmpared to other RISC processors. 
The POWER architecture defines an 
rnhanced condition register facility. 
The problem with traditional condi- 
::on register architectures is that the 
setting of condition bits as a side ef- 
fect of instruction execution poses 
>erious limitations on the compiler’s 
ability to reschedule instructions. 
Additionally, a condition register rep- 
resents a single architectural resource 
that causes a serious bottleneck in a 
machine that executes multiple in- 
Aructions in parallel or out of order. 
Some RISC architectures avoid the 
rroblem by completely eliminating 


the condition register and requiring 
conditions to be explicitly set (by a 
compare instruction) in a general 
register and/or by folding the com- 
parisons into the branch instructions 
themselves. The latter approach po- 
tentially overloads the branch- 
execute pipeline stage. Therefore the 
POWER architecture chose instead to 
fix the problems of the traditional 
condition register approach by: 
a) providing an opcode bit in each 
instruction to make the condition 
register update optional, thereby re- 
storing the compiler’s ability to rear- 
range code, and b) providing multi- 
ple condition registers (eight) to 
avoid the single resource problem 
and to provide a large condition reg- 
ister namespace, so the compiler can 
allocate and schedule condition regis- 
ter resources as it does for general 
registers. 

Another reason for selecting the 
enhanced condition register model 
was that it is consistent with the orga- 
nization of the machine into indepen- 
dent execution units. Conceptually, 
the condition register is local to the 
branch unit. Consequently, it is not 
necessary to access the general regis- 
ter file (which is local to the fixed- 
point unit) to evaluate and execute a 
conditional branch. To the extent the 
compiler can schedule condition code 
updates early (and/or load the branch 
address registers early) the hardware 
can lookahead and fold-out resolv- 
able branches from the instruction 
stream. This avoids the instruction 
issue slot normally occupied by the 
branch instruction, and allows the 
instruction dispatcher to feed a con- 
tinuous linear stream of instructions 
to the computational execution units. 

Evolution of the PowerPC 
Architecture 

Satisfying the diverse needs of three 
major corporations like Apple, IBM, 


and Motorola to meet their collective 
long-term vision of computing re- 
quired some modifications to the 
POWER architecture. So, with the 
goal of maintaining RS/6000 software 
compatibility, a team of architects 
from IBM, Apple, and Motorola set 
out to refine the architecture. A num- 
ber of changes were made to the ar- 
chitecture in the following general 
categories: 

• simplifying the architecture to be 
more appropriate for low-cost single- 
chip microprocessors 

• eliminating instructions that might 
impede clock rates 

• removing architecturally imposed 
barriers to superscalar dispatch and 
out-of-order execution 

• encouraging symmetric multipro- 
cessor systems by adding multipro- 
cessor support features 

• adding new features deemed nec- 
essary for anticipated applications 

• clearly defining the line between 
“architecture” and “implementation” 

• assuring a long lifetime for the ar- 
chitecture by extending it to a true 
64-bit architecture 

These changes resulted in a new ar- 
chitecture, officially called the 
PowerPC Architecture, which will 
form the basis for next-generation 
products not only from the three 
founding companies but from a large 
number of other companies as well. 

The PowerPC Architecture main- 
tains the same basic programming 
model and instruction opcode assign- 
ments as the POWER architecture. 
Where changes were made that could 
potentially prevent PowerPC proces- 
sors from running existing RS/6000 
binaries, care was taken to remove or 
change the feature in such a way that 
it could be trapped and emulated in 
software. To make this approach 
practical, features were changed only 
if they were either used infrequently 


COMMUNICATIONS OP THE ACM June 19947 Vol.37, No.6 31 






n 9 ° ^ the PowerPC 


in application code or were isolated 
in library routines that could easily be 
replaced. 

Some of the more significant 
changes made in going from the 
POWER to the PowerPC architec- 
tures in the categories listed previ- 
ously include the following: 

• Simplification 

— eliminated several bit-field in- 
structions that used three source 
operands (to avoid the need for 
an extra general register file 
port) 

— redefined the real-time-clock as 
a simple binary counter with 
variable count rate 
eliminated special segments and 
the associated fine-grain locking 
eliminated the most complex 
string instruction 

• Higher clock rates 

eliminated four instructions 
whose operation was dependent 
on the value of the source oper- 
and 

• Removing superscalar barriers 

eliminated the MQ register and 
all extended precision shifts, ex- 
tended integer multiply, and the 
divide-with-remainder instruc- 
tions that used it 
— added subtract without carry 
added floating-point imprecise 
exception modes 

• Multiprocessor support 
— added reservation model for 

atomically updating shared 
memory 

— defined the weakly ordered stor- 
age model 

— added new memory transaction 
ordering instructions 
—redefined the user mode cache 
control instructions for use in 
multiprocessor systems 
—defined memory aliasing rules 
replaced two-level inverted page 
table with single hashed page 
table capable of supporting con- 
current aliasing 
1 New features 

— added single-precision floating- 
point instructions 
— added unsigned fixed-point 
multiply and divide 
— added new storage attribute con- 
trols on memory pages 
added little-endian memory 
addressing mode 


— added variable-size block ad- 
dress translation capability 
• Extension to 64 bits 
— defined superset architecture 
that supports full 64-bit liner 
logical address space and 64-bit 
integer arithmetic 
— defined segment table to provide 
an 80-bit virtual address space 
(to replace the segment registers 
used in 32-bit addressing that 
provided a 52-bit virtual address 
space) 

— extended the page table formats 
to support a full 64-bit physical 
address space 


The most far-reaching change to 
the architecture was the extension to 
64 bits, which involved a number of 
modifications to the user program- 
ming model, instruction set, and ad- 
dress translation mechanisms. We 
defined the PowerPC Architecture as 
a full 64-bit architecture which has a 
32-bit subset. The architecture per- 
mits both 32- and 64-bit versions of 
PowerPC processors, but all proces- 
sors are required to support 32-bit 
piograms as a minimum. The archi- 
tecture defines a 32/64-bit mode 
switch controllable from supervi- 
sor code that allows a 64-bit proces- 
sor implementation to run 32-bit 
programs. 

The primary change to the user- 
visible architecture was to extend the 
width of the general registers and the 
branch address register to 64 bits. On 
processors that implement the full 
64-bit architecture, all instructions 
now simply work on full 64-bit regis- 
ters rather than 32-bit wide registers. 
Nearly all instructions are entirely 
mode-independent. The only signifi- 
cant effect of the mode switch is to se- 
lect how much of a 64-bit effective 
address is used in address translation. 
(There are a few other minor effects 
such as selecting the ALU bit from 
which ALU conditions such as carry 
and overflow are generated.) The 
address translation mechanism for 
64-bit processors is similar to the 
translation mechanism used on 32-bit 
implementations, except that the seg- 
ment registers are replaced with a 
segment table to handle the larger 
logical and virtual address spaces, 
and the page table format was ex- 
tended to accommodate the larger 


virtual and physical address spaces. 
But these translation changes are not 
apparent to application software. 

The PowerPC 601 

The first PowerPC microprocessor, 
the PowerPC 601 microprocessor, is 
now available from both IBM and 
Motorola. The 601 is a medium- 
sized, medium-performance proces- 
sor suitable for low- to medium-cost 
desktop computer systems. It was 
designed as a transition processor 
from the POWER architecture to the 
PowerPC Architecture. Thus it imple- 
ments a superset of both POWER and 
PowerPC features so that existing RS/ 
6000 binaries will run at full speed. 
This provides additional time for the 
compilers to be retargeted to the 
PowerPC Architecture and applica- 
tions to be recompiled before the pro- 
cessors which implement strictly the 
PowerPC Architecture become widely 
available. 

The 601 was based on an IBM 
single-chip processor that was being 
designed when the alliance was first 
formed. But the 601 underwent 
major enhancements to improve per- 
formance and reduce costs (see side- 
bar “A Comparison of PowerPC 601 
and PowerPC 603 Features.”) For 
example, a more sophisticated 
branch unit, enhanced with multi- 
processor features including Moto- 
rola’s 88110 high-performance mi- 
croprocessor bus interface was 
included. The 601 implements a 
moderately aggressive superscalar 
microarchitecture capable of dis- 
patching 3 instructions, possibly out- 
of-order, on each clock cycle. 

Other PowerPC Processors 

IBM and Motorola, with Apple engi- 
neering participation, have put into 
operation a new design center to de- 
velop future PowerPC microproces- 
sors. The Somerset Design Center is a 
37,000 square-foot facility located in 
Austin, Texas, staffed primarily by 
Motorola and IBM with approxi- 
mately 300 engineering profession- 
als. The design center is presently 
working concurrently on three sepa- 
rate PowerPC microprocessors. The 
three parts currently in development 
in the design center include: 


• The 603: a processor intended pri- 


32 June 1994/Vol.37, N 0.6 COMMUNICATIONS OP 


narilv for the cost-sensitive, desktop 
■id portable personal computer sys- 
tems 

• The 604: a high-performance part 
r r uniprocessor or multiprocessor 

r'ktop personal computers and 

rkstations 

• The 620: a 64-bit high-perfor- 
mance part for high-end worksta- 

i . ns, servers, and multiprocessor sys- 
tems 

Engineers at the Somerset Design 
■. enter employ a formal VLSI design 
methodology derived from the best of 

th IBM’s and Motorola’s CAD 
: : ols. The new designs all use an ad- 
.meed 0.5/xm CMOS technology 
ming a common set of design rules 
: r both IBM and Motorola semicon- 
ductor fabrication facilities. 

The first processor designed com- 
pletely in the new design center was 
the 603, which is now sampling. The 
603 employs a slightly more aggres- 
sive microarchitecture than the 601, 
but has a smaller cache giving it ap- 
proximately the same performance as 
the 601 at lower cost (see sidebar “A 
Comparison of PowerPC 601 and 
PowerPC 603 Features”). The 603 is 
also superscalar, capable of dispatchi- 
ng 3 instructions per clock, in-order, 
into 5 concurrent execution units. It 
rinploys register renaming, reserva- 
:ion stations, speculative execution, 
and out-of-order instruction execu- 
:ion and completion to boost instruc- 
:ion parallelism. The 603 operates at 
3.3v and utilizes static design, auto- 
matic power-down circuitry, and a 
number of software-controlled 
power-saving modes to make it use- 
ful in laptop systems as well as low- 
cost desktop systems. 

The 604 and 620 products are not 
vet officially announced during the 
development time of this article, but 
are scheduled to see silicon during 
1994. In addition to these four pro- 
cessors, other new PowerPC proces- 
sors are in development at both Mo- 
torola and IBM internal design 
centers. These designs will target a 
range of specific markets ranging 
from very low-cost, high-volume 
embedded control markets, to very 
low-power subnotebook computer 
markets, all the way to very high-end 
computers. Also, research is under 
way into advanced microarchitectural 


techniques for the next generation of 
billion-instruction-per-second class 
microprocessors. 

Summary 

The PowerPC Architecture is the 
product of nearly 20 years of work on 
RISC architectures beginning with 
IBM’s seminal 801 minicomputer in 
the 1970s and finally refined by Ap- 
ple’s experience in advanced per- 
sonal computers and by Motorola’s 
experience in delivering low-cost 
single-chip microprocessors into 
high-volume markets. This architec- 
ture is currently being used as the 
base for a wide variety of instruction- 
set compatible microprocessors. With 
their combined resources, IBM, 
Apple, and Motorola intend to de- 
liver an unparalleled range of 
PowerPC RISC processors into the 
market, all the way from the very low- 
est end of the portable computer 
marketplace to the very highest end 
of the supercomputer market. □ 

References 

1. Agerwala, T. and Cocke, J. High per- 
formance reduced instruction set pro- 
cessors. IBM Tech. Rep., March 1987. 

2. Gullette, B.' The design of the 88110 
Bus Interface. In Proceedings of RISC 
’92, Feb. 1992. 


3. Oehler, R.R. and Groves, R.D. IBM 
RISC System/6000 processor architec- 
ture. IBM J. Res. Develop. 34 (Jan. 
1990), 23-36. 

4. Patterson, D.S. and Ditzel, D.R. The 
case for the reduced instruction set 
computer. Comput. Architecture News 
(Oct. 15, 1980). 

5. Radin, G. The 801 Minicomputer. In 

Proceedings of the Symposium on Architec- 
tural Support for Programming Languages 
(March 1982), 39-47. 

About the Author: 

KEITH DIEFENDORFF is a micropro- 
cessor architect and member of the techni- 
cal staff in Motorola’s RISC microproces- 
sor division. Current research interests 
include advanced architectural concepts 
for future high-performance microproces- 
sors. Author’s Present Address: Moto- 
rola, Inc., 6501 William Cannon Drive 
West, Austin, TX 78735-8598; email: 
keith_diefendorff@email.sps.mot.com 

IBM, PowerPC, PowerPC 603, PowerPC Archi- 
tecture, and RISC System/6000 are trademarks 
of International Business Machines Corpora- 
tion. 


Permission to copy without fee all or part of this 
material is granted provided that the copies are 
not made or distributed for direct commercial 
advantage, the ACM copyright notice and the 
title of the publication and its date appear, and 
notice is given that copying is by permission of 
the Association for Computing Machinery. To 
copy otherwise, or to republish, requires a fee 
and/or specific permission. 

© ACM 0002-0782/94/0600 $3.50 


loin The First Society in Computing... 


ASSOCIATION FOR 
COMPUTING MACHINERY 

Call Today For More Information 

1 - 800 - 342-6626 

(In U.S. & Canada) 

1 - 212 - 626-0500 

(In Metro N.Y. & Outside U.S.) 


COMMUNICATIONS OF THE ACM June 1994/Vol.37, No.6 3 3 


Brad Burgess, Nasr Ullah, Peter Van Overen, and Deene Ogden 



The PowerPC 603 Microprocessor 

■n October 1993, Motorola and IBM unveiled the first low-power version of the PowerPC 
family — the PowerPC 603 microprocessor. Measuring a mere 85mm 2 (7.4 x 11.5 mm) in 
size, the 603 contains 1.6 million transistors and consumes less than 3 watts of power 
when operating at 80MHz. With estimated performance values of 75 SPECInt92 and 85 
SPECfp92, the 603 is comparable in performance to present-day high-end personal 
computer and workstation processors. This member of the PowerPC family is designed 
to bring high-performance and low-power capabilities to the laptop and low-cost desktop 
computer markets. 


Following closely on the heels of its 
predecessor, the PowerPC 601 micro- 
processor [1], the 603 microprocessor 
was developed at the joint Motorola/ 
IBM/Apple Somerset Design Center 
in Austin, Texas. The 603 microarchi- 
tecture evolved from Apple, IBM, 
and Motorola’s collective experience 
on several past designs. The similar- 
ity of the POWER and PowerPC ar- 
chitectures permitted the use of sam- 
ple traces generated by RISC 
System/6000 machines for evaluation 
of design trade-offs. The compiler 
groups also provided their insight to 
ensure the traces from the past gen- 
eration of processors and compilers, 
with their own specific peculiarities, 
did not misguide the 603’s microar- 
chitecture definition, and that trade- 
offs selected were appropriate for the 
next generation of compilers. 

To accelerate the design and test 
process, engineers employed a formal 


VLSI design methodology derived 
from the best of both IBM and Moto- 
rola’s CAD tools. These tools enable 
both the rapid design and dense 
packing capability necessary to pro- 
duce very high-volume, high-yield 
microprocessors for the commercial 
market. The 603 design team em- 
ployed a combination of custom cir- 
cuitry (for arrays), library compo- 
nents (for data paths), and standard 
cell place and route (for random 
logic) to accomplish the 603 design. 

Using the best tools and methodol- 
ogy available, the design team took 
the 603 from concept to working sili- 
con in 18 months. Ongoing design 
evaluation and debugging, including 
simulation of 28 billion processor cy- 
cles prior to tape-out, provided fully 
functional first-pass silicon that ran at 
the design target speed of 80MHz. 

The PowerPC 603 microprocessor 
is manufactured by Motorola in Aus- 


tin, Texas, and by IBM in Burlington, 
Vt. Motorola and IBM both fabricate 
the 603 using a 0.5yum, 4-level metal, 
3.3VDC CMOS process with design 
rules compatible with both compa- 
nies’ semiconductor processes. The 
die is designed to be packaged 
in either a 240-pin ceramic quad flat 
pack or a ball-grid array package. Fig- 
ure 1 is a photograph of the 603 die. 

Functional Overview 

The 603 is the first processor in the 
PowerPC family to fully support the 
PowerPC Architecture. It incorpo- 
rates five execution units: branch, in- 
teger, floating-point, load/store, and 
system register; and a pair of on-chip 
8KB instruction and data caches. 


Figure i. PowerPC 603 micropro- 
cessor die photograph 


34 June 1994/Vol.37, N 0.6 COMMUNICATIONS OP THE ACM 




COMMUNICATIONS OP the ACM June 1994/Vol.37, No.6 3 5 


64-BIT 


BRANCH 

PROCESSING 

UNIT 



SRs 

ii 

DBAT 


DTLB 


Array 


64-BIT 


SRs 


IBAT 



ITLB 


Array 


Power 

Dissipation 

Control 

Time Base 
Counter/ 
Decrementer 

JTAG/COP 

Interface 

Clock 

Multiplier 


Tags 


8- Kbyte 
D Cache 



Tags 


8-Kbyte 
I Cache 


Touch Load Buffer 

— L 

PROCESSOR BUS 
INTERFACE 

Copyback Buffer 

L 



32-BIT ADDRESS BUS 


32-BIT DATA BUS 


-► 


36 June !994/Vol.37, No.6 





























The 


Making of the 


Since the 603 is a super-scalar micro- 
p : cessor, it is capable of issuing and 
retiring as many as three instructions 
. er clock to these execution units. 
For increased performance, the 603 
v.s instructions to be executed 
o _;:-of-order. Additionally, the 603 
r ; jvides programmable power re- 
action modes that permit systems 
designers the flexibility of imple- 
menting a variety of power manage- 
ment techniques. A block diagram of 
the 603 is shown in Figure 2. 

Instructions are dispatched in- 
rder to one of the five execution 
Links. If there are no operand depen- 
dencies, execution occurs immedi- 
nelv. The integer unit executes most 
instructions in one cycle. The 
floating-point unit is pipelined and 
executes both single and double pre- 
cision floating-point operations, 
branch resolution is handled by the 
: ranch unit. If the branch conditions 
ne available, branches are immedi- 
ately resolved; otherwise, instruction 
execution continues speculatively. 
Instructions that modify the proces- 
s or control registers are executed by 
the system register unit. Finally, data 
movement between the data cache 
and the general-purpose and 
floating-point registers is handled by 
the load/store unit. 

In case of cache misses, the caches 
access main memory through a 64-bit 
mgh-performance bus similar to that 
the MC88110 [8]. To maximize 
iiroughput and thus increase overall 
performance, the cache communi- 
mtes with memory mostly via burst 
iterations that allow a cache line to 
be filled in one transaction. 

.After an instruction finishes execu- 
:;on in an execution unit, its results 
me forwarded to a completion buffer, 
pnd then subsequently written to the 
dDpropriate register file set when the 
instruction is retired from the com- 
pletion buffer. To avoid register con- 
tention, the 603 provides separate 
32-entry integer general-purpose 
registers (GPRs) and floating-point 
register (FPRs) sets for the storage of 
operands. 

The following sections discuss in 


Figure2. Block diagram of the 
PowerPC 603 microprocessor 


more detail the factors that contrib- 
ute to the efficient flow of instructions 
and data through the 603. 

Instruction Pipeline 

Figure 3 shows the 603’s instruction 
pipelines for several types of instruc- 
tions. 

Fetch stage. During this stage, the 
instruction fetcher retrieves two in- 
structions at a time from the instruc- 
tion cache (regardless of alignment), 
unless the address points to the last 
word of a cache line, in which case 
only a single word is returned. 


detects that an instruction is tagged as 
having caused an exception, it flushes 
the pipeline and initiates exception 
processing. Otherwise, it retires com- 
pleted instructions and removes 
them from the completion buffer. 
Because the completion logic retires 
all instructions in program order, all 
exceptions are fully precise. 

Functional Units 

Dispatch unit. The instruction flow 
diagram is shown in Figure 4. In- 
structions are fetched two per cycle 
from the instruction cache, and 



Decode/Source stage. During this 
stage, the dispatcher and branch unit 
decode instructions, allocate rename 
registers, read available source oper- 
ands, and dispatch instructions to 
their respective execution units (or 
reservation stations). 

Execute stage. During this stage, 
the execution units execute instruc- 
tions and write results back to the 
destination rename registers. If the 
data is needed as a source operand 
for another instruction, the data is 
forwarded immediately to the re- 
questing unit. When the execution 
unit finishes with an instruction it 
notifies the completion buffer that 
the instruction is finished and tags 
the instruction if any exceptions 
occurred. 

Completion stage. During this final 
stage, the completion buffer logic 
writes the contents of any renamed 
registers into the architectural regis- 
ters. It then deallocates rename regis- 
ters and returns them to the pool for 
future use. If the completion logic 


Figure 3. PowerPC 603 micropro- 
cessor master instruction pipe- 
line 


placed in a six-entry instruction 
queue. On arrival at the instruction 
queue, branch instructions are imme- 
diately forwarded to the branch pro- 
cessing unit for resolution. All other 
instructions are issued from the in- 
struction queue at the rate of two per 
cycle if there is no resource conten- 
tion (for example, a busy execution 
unit, or a full completion queue). 

The instruction dispatcher de- 
codes the bottom two entries of the 
queue and dispatches up to two in- 
structions per cycle, in program 
order, to either the integer unit, 
floating-point unit, load/store unit, or 
system unit. If the instruction dis- 
patcher finds an execution unit busy, 
it does not dispatch the instruction 
and stalls. There are several mecha- 
nisms in the 603 to avoid dispatch 
stalls. 


COMMUNICATIONS or THE ACM June 1994/Vol.37, No.6 37 



erand dependencies, the 603 has 
single-entry zero-latency reservation 
stations associated with each execu- 
tion unit. The reservation station 
holds the instruction until all oper- 
ands are available. This allows the 
dispatcher to dispatch subsequent 
instructions to other execution units 
without stalling the instruction 
queue. 

In addition to dispatching instruc- 
tions, the dispatcher allocates rename 
buffers and coordinates pipeline 
stalls. Rename buffers provide tem- 
porary storage for the results of an 
instruction’s execution. Register re- 
naming helps avoid stalls on register 
write-after-write and write-after-read 
hazards. In addition, the rename 
buffers simplify exception recovery 
by allowing the 603 to invalidate re- 
sults of speculative instruction execu- 
tion without affecting the contents 
of the general registers, floating- 
point registers, and processor control 
registers. 

When the instruction queue is rela- 
tively full, the branch unit decodes 
instructions upstream from those 
being decoded by the dispatcher. For 
taken branches, the instruction queue 
usually contains enough instructions 
to keep the dispatcher busy until the 
new instruction stream is fetched. 
Thus, from the dispatcher’s perspec- 
tive this allows basic blocks to be con- 
nected without ever seeing the 
branch instruction or the fetch pen- 
alty. Once dispatched, the dispatcher 
transfers control of the instruction’s 
execution to the execution unit, and 
control of the overall instruction 
stream to the completion buffer logic. 

Execution units . The branch unit 
decodes the fetched instructions and 
executes most branch instructions in 
a single clock. It also retires and elim- 
inates from the instruction queue 
(branch folding) branches that do not 
modify branch-related resources such 
as the Link or Count registers. The 
Link register contains the return 
address from a branch instruction. 
The Count register is a loop counter 
that is used by some branch instruc- 
tions. 

The branch unit has its own facili- 
ties for calculating the branch target 
address. Conditional branches in the 
PowerPC Architecture depend on a 


e r P c 

counter-register (CTR) value and/or 
any one of 32 condition register (CR) 
bits. 

If the CTR value is unavailable, the 
branch unit and instruction fetching 
is stalled (not likely to occur very 
often). If a CR bit is unavailable, the 
branch unit predicts either the taken 
or fallthrough path based on bits in 
the branch opcode. These predicted 
instructions are tagged as speculative 
and proceed down the pipeline nor- 
mally. The unavailable CR bit that 
initiated speculative execution is 
checked each cycle. When the CR bit 
becomes available and the branch re- 
solves, the branch unit flushes all 
speculative tagged instructions from 
the pipeline if the branch was mispre- 
dicted, or simply clears all speculative 
tags if the branch was correctly pre- 
dicted. The branch unit can only re- 
solve a single CR bit at a time, thus it 
can only speculate down one condi- 
tional branch path at a time. 

The integer unit processes integer 
arithmetic, logical, and bit-field in- 
structions. All integer instructions are 
single cycle with the exception of 
multiply and divide, which requires 2 
to 6 (data-dependent) and 37 proces- 
sor clock cycles respectively. 

The load/store unit handles load and 
store instructions to and from both 
the integer and floating-point regis- 
ters. It contains a dedicated adder for 
the calculation of effective addresses, 
and the logic required for data align- 
ment to and from the cache. 

The load/store unit is fully pipe- 
lined so that loads can be dispatched 
at the rate of one per clock cycle. 
Loads have a two-clock-cycle latency, 
a half cycle to compute the effective 
address, one cycle to access the data 
cache and MMU, and another half 
cycle to write the result into the re- 
name register. 

Since the load/store unit cannot 
write to the cache until after checking 
for memory protection violations, it 
does not execute store instructions in 
a fully pipelined manner. During the 
execution stage, the load/store unit 
calculates the effective address and 
translates the address to check for 
memory protection violations. On the 
next clock cycle, the store advances to 
a holding buffer, where it waits for 
the completion logic to retire the in- 
struction and enable writing data to 


the cache. 

The 603 takes advantage of the 
PowerPC Architecture’s weakly or- 
dered memory model and allows load 
instructions to bypass pending stores 
in order to minimize stalls due to load 
data hazards. (Of course, loads which 
may potentially access the same loca- 
tion as the pending store are not al- 
lowed to bypass in order to preserve 
correct program operation.) 

Th e floating-point unit is fully pipe- 
lined with single-cycle throughput 
and three-clock-cycle latency for all 
single-precision instructions (except 
divide) and double-precision adds, 
subtracts, and compares. The first 
stage of the pipeline performs oper- 
and alignment and multiplication, 
the second stage performs addition, 
and the third stage rounds and nor- 
malizes the result. For double-preci- 
sion multiplies and multiply-adds, 
the first pipeline stage is iterated 
twice, providing two-clock-cycle 
throughput and four-clock-cycle la- 
tency. Divides are nonpipelined and 
stall the floating-point unit for 18 
clock cycles for single-precision oper- 
ations and 33 clock cycles for double- 
precision operations. 

The floating-point unit supports 
denormalized numbers in hardware. 
Since denormalized results require 
more time to process, a special non- 
IEEE mode provides a means to co- 
erce denormalized results to zero. 
This eliminates data-dependent in- 
struction execution time that would 
disrupt real-time data processing. 

The primary function of the system 
register unit is to process the condi- 
tion-register-logical instructions and 
the move operations to or from the 
SPRs. These instructions execute in 
one to three cycles, and are serializ- 
ing in nature (that is, all preceding 
instructions must have been com- 
pleted and retired). As a result they 
have somewhat less performance 
than might otherwise be expected. 
However, given the relatively infre- 
quent use of instructions executed by 
the system register unit, the trade-off 
was made in favor of reduced com- 
plexity and silicon costs. 

Completion buffer. The completion 


Figurea. PowerPC 603 micropro- 
cessor instruction flow 


38 June 1994/Vol.37, N0.6 



COMMUHICATIONS OF THE ACM June 1994/ Vol.37, No.6 39 








603 Features 


superscalar instruction processor 


. 2 instructions fetched per clock 

• 5 independent execution units 

—branch, integer, floating-point, load/store, and system 
. 3 instructions issued per clock (2 dispatched plus 1 branch) 

• 2 instructions retired per clock plus branch folding 

• 32, 32-bit general registers and 32, 64-bit floating-point registers 

• 3-stage pipelined floating-point unit 

—floating-point denorm support in hardware 

• 2-stage pipelined load-store unit 

—loads bypass stores 
—single entry store buffer 

• Static branch prediction and speculative execution 


Memory System: 


► 8K, 2-way set associative, 32 -bytes/line, instruction cache 

► 8K, 2-way set associative, 32-bytes/line, copyback data cache 

► Support for big-endian and little-endian addressing 

► 64-entry, 2-way set associative, instruction TLB 

► 64-entry, 2-way set associative, data TLB 

—hardware-assisted software table walk for TLB reloads 

• 4 instruction block address translation registers (1BAT) 

. 4 data block address translation registers (DBAT) 

• 16 segment registers 

i 64-bit, pipelined, split transaction, burstable external bus with parity 

• 32-bit data bus option 

• internal clock multiplier lx, 2x, 3x, 4x 


System Features: 


Power Down modes (DOZE, NAP, SLEEP) 

• JTAC interface 

• On-chip hardware debug support 


buffer tracks instruction execution, 
retires finished instructions, and con- 
trols writing of the contents of re- 
named registers to the architectural 
registers. The completion buffer 
tracks up to five instructions at a time 
and can retire up to two instructions 
each clock cycle. Because instructions 
may execute out of order, the com- 
pletion buffer provides an ordering 
mechanism that makes instruction 
completion appear sequential, and 
provides a mechanism for precise 
exception handling for PowerPC 603 
microprocessor systems. 

Memory Subsystem 

The memory subsystem provides the 
instructions for the instruction 
fetcher and data for the load/store 
unit. Efficient access between the 
caches and memory systems is pro- 
vided by the MMU, and the external 
bus interface. 

Caches and Memory Management 

The 603 incorporates two 8KB, 


2-way set associative, 32-byte per line 
on-chip caches, one for instructions 
and one for data. On a cache hit, the 
instruction cache can provide 2 in- 
structions per cycle to the instruction 
queue, and the data cache can pro- 
vide up to a double-word of data to 
the load/store unit per cycle. 

The data cache supports copy-back 
or write-through policies. Both 
caches use a least recently used (LRU) 
replacement policy. On a cache miss, 
the 603’s cache blocks are filled in 
four beats of 64 bits each. The burst 
fill is performed as a ‘critical-double 
word-first’ operation; the critical dou- 
ble word is simultaneously written to 
the cache and forwarded to the re- 
questing unit, thus minimizing stalls 
due to cache fill latency. In the case of 
the data cache, the burst operation is 
also used to write-back a modified 
cache line to memory. 

Since the 603 was not specifically 
targeted for multiprocessing applica- 
tions, the 603 design restricted cer- 
tain aspects of data cache coherency 


in order to save silicon area. Although 
the data cache only supports a three- 
state MEI (modified, exclusive, in- 
valid) cache coherency protocol, it is 
compatible with full MESI (modified, 
exclusive, shared, invalid) caches on 
the same bus. Additionally, the data 
cache implements only a single set of 
cache tags which must be arbitrated 
for between snooping operations and 
load/store activity, with priority given 
to snoop operations. 

The load/store unit provides the 
data transfer interface between the 
data cache and the GPRs and the 
FPRs. In addition the load/store unit 
provides all logic required to calcu- 
late effective addresses, handles data 
alignment to and from the data 
cache, and provides sequencing for 
load-and-store string and multiple 
operations. The caches provide a 64- 
bit interface to the instruction fetchei 
and load/store unit. 

For faster translation of addresses, 
the 603 provides two four-entry, fully 
associative block address translation 
registers, and two 64-entry, 2-way set 
associative translation look-aside 
buffers (TLBs) for instructions and 
data. The 603 uses software to per- 
form TLB replacements. A hash 
function is used for the replacement 
policy. When a TLB miss occurs, the 
processor takes a special exception to 
the software tablewalk handler. This 
handler walks through the page ta- 
bles to locate and reload the neces- 
sary page table entry into the TLB. 
Additionally, the 603 has dedicated 
scratch pad registers that can be used 
to shadow general-purpose registers 
during software tablewalks. Through 
hardware assist, the entire tablewalk 
routine fits in two cache lines of mem- 
ory and provides a low-cost, flexible, 
and fairly fast TLB reload capability. 

External Bus interface 

The 603’s external bus is compatible 
with the 601 [1], which was derived 
from Motorola’s 88110 multiproces- 
sor bus [8]. The 603’s bus interface 
unit (BIU) receives requests for bus 
operations from the instruction and 
data caches, and executes the opera- 
tions per the 603 bus protocol. Mem- 
ory accesses can occur in single-beat 
(1 to 8 bytes) and four-beat burst (32 
bytes) data transfers when the bus is 
configured as 64 bits, and in single- 




40 June 1994/Vol.37, N0.6 COMMUNICATIONS OP THE ACM 


The Making of The 


With the flexibility built into the PowerPC Architecture, we expect to 


deliver a wide range of microprocessors all the way from low-cost 


embedded controllers through massively parallel supercomputers. 


T>eat (1 to 4 bytes), two-beat (8 bytes), 
and eight-beat (32 bytes) data trans- 
fers when the bus is configured as 32 

bits. 

The BIU provides address queues, 
and prioritization and bus control 
1 >gic. It consists of an independently 
arbitrated 32-bit address and 64-bit 
data buses (the data bus can option- 
ally be configured as 32 bits). The 
603’s bus transaction consists of sepa- 
rate address and data tenures. This 
allows a variety of bus arbitration 
schemes to be supported. Specifically, 
the 603 supports address pipelining, 
where the address tenure of a new 
Transaction is allowed to begin before 
the data tenure of the current trans- 
ition has completed. The 603’s bus 
interfee also supports split transac- 
tions, where the address and data 
tenures can be arbitrated for and con- 
rolled by different masters; and en- 
veloped transactions, where the ad- 
dress and data tenures of a new 
Transaction can occur after the ad- 
dress tenure of a previous transaction 
has ended, but before the data tenure 
for the previous transaction has 
begun. Enveloped transactions can 
be used for deadlock prevention in 
hierarchical bus environments, and 
to speed snoop push operations. 

The 603 bus protocol also supports 
an address retry capability to support 
an efficient snooping protocol for 
memory coherency across the system. 
In a multiple processor system, the 
address retry can be used by a snoop- 
ing master to interrupt another mas- 
ter’s transaction on the bus. This be- 
comes necessary when a bus master 
begins a transaction to access data 
That has been locally modified in the 
603’s cache. The 603 (referred to as 
The snooping 603) detects the access 
to this modified memory region, and 
uses the retry capability to force the 
first master to abort its transaction 
and retry it later. This enables the 


snooping 603 to write back the modi- 
fied data to memory for use later by 
the bus master that has been retried. 
The 603 bus interface also contains a 
clock multiplier that enables the pro- 
cessor to run at twice, three times, or 
four times the external bus clock 
speed. 

Debug Features 

The 603 incorporates a JTAG/IEEE 
1149.1 boundary scan interface to 
facilitate board-level testing. The 603 
also incorporates a special interface 
(accessible through the JTAG port) 
that allows an external service proces- 
sor to read or write memory or any of 
the 603’s internal registers. A special 
mode allows pipeline status informa- 
tion to be displayed for tracking the 
instruction stream in real time. Addi- 
tionally, a programmable instruction 
address breakpoint is provided to as- 
sist in software debugging. 

Conclusion 

The combined efforts of Apple, IBM, 
and Motorola have been focused, on 
creating PowerPC, a new RISC archi- 
tecture that will form the basis of a 
whole new generation of high-perfor- 
mance, low-cost computers. With the 
flexibility built into the PowerPC Ar- 
chitecture, we expect to deliver a 
wide range of microprocessors all the 
way from low-cost embedded control- 
lers through massively parallel super- 
computers. The 603 is the first in a 
series of PowerPC microprocessors 
targeted at high-volume, low-cost, 
portable and desktop personal com- 
puters. It provides performance 
heretofore available only in signifi- 
cantly higher-priced, higher-cost, 
and higher-powered processors. Q 

References 

1. Allen, M. and Becker, M. Multipro- 
cessing aspects of the PowerPC 601 
Microprocessor. In Proceedings of 
COMPCON (Feb. 1993). 


2 . Becker, M.C., Allen, M.S., Moore, 
C.R., Muhich, J.S., and Tuttle, D.P., 
The PowerPC 601 Microprocessor. 
IEEE Micro. 13, 5 (Oct. 1993), 54-68. 

3 . Chang, A., Mergen, M.F., Rader, 
R.K., Roberts, J.A., and Porter, S.L. 
Evolution of storage facilities in AIX 
Version 3 for RISC System/6000 pro- 
cessors. IBM J. Res. Develop. 34 (Jan. 
1990). 

4. Cohen, D. On holy wars and a plea for 
peace. IEEE Comput., 14, 10 (Oct. 
1981). 

5. Dubois, M.C. and Scheurich, Briggs, 
F. Synchronization, coherence, and 
event ordering in microprocessors. 
IEEE Comput., 21, 2 (Feb. 1988), 9-21. 

6. ANSI/IEEE- 1985 Standard for binary 
floating-point arithmetic, IEEE, Pis- 
cataway, N.J., 1985. 

7 . Groves, R.D. and Oehler, R.R. RISC 
System/6000 processor architecture. 
Microprocessor & MicroSystems 14, 6 
(July/Aug. 1990). 

8. Gullette, B. The Design of the 88110 
Bus Interface. In Proceedings of RISC 
’ 92 (Feb. 1992). 

9 . Lamport, L. How to make a multipro- 
cessor computer that correctly exe- 
cutes multiprocess programs. IEEE 
Trans. Comput., C-28, 9 (Sept. 1979), 
241-248. 

10 . Oehler, R.R., and Groves, R.D. IBM 
RISC System/6000 processor architec- 
ture. IBM J. Res. Develop. 34, 1 (Jan. 
1990), 23-36. 

11 . Patterson, D.A., and Ditzel. The case 
for the reduced instruction set com- 
puter. Computer Architecture News (Oct. 
15, 1980). 

12 . Patterson, D.A. and Sequin C.H. 
RISC-1: A Reduced Instruction Set 
VLSI Computer. In Proceedings of the 
Eighth Annual International Symposium 
on Computer Architecture (May 1981), 
pp. 443-457. 

13 . The PowerPC Architecture. Morgan- 
Kaufmann, San Mateo, Calif., Dec. 
1993. 

14. Radin, G. The 801 Minicomputer. In 
Proceedings of the Symposium on Architec- 
tural Support for Programming Lan- 
guages (March 1982), pp. 39-47. 

15 . Simpson, R.O. and Hester, P.D. The 
IBM RT PC ROMP processor and 


COMMUNICATIONS OP THE ACM June 1994/Vol.37, No.6 41 



1 ^ I 


g of the 


memory management architecture. 
IBM Syst.J. 26 (1987), 346-360. 

16 . Stone, J.M. and Fitzgerald, R.P. An 
overview of storage in PowerPC. IBM 
Res. Rep. RC 19133, Aug. 1993. 

About the Authors: 

BRAD BURGESS is a member of the tech- 
nical staff at Motorola involved in both 
instruction set architecture and microar- 
chitecture implementation. Current re- 
search interests include microarchitecture 
implementation and optimization, perfor- 
mance analysis, and compiler optimiza- 
tion. email: bradb@ibmoto.com 

NASR ULLAH is a RISC applications de- 
signer in Motorola’s RISC microprocessor 
division who has been working on 
PowerPC microprocessor board designs. 
Current research interests include parallel 
processing, computer design, computer 
architecture, performance measurement, 
evaluation and prediction, email: nasr_ 
ullah@riscgate.sps.mot.com 


3 ■* P C 

PETER VAN OVEREN is a RISC applica- 
tions designer at Motorola who has been 
developing user manuals and associated 
documentation for the PowerPC micro- 
processor family. Current research inter- 
ests include electronic information distri- 
bution, computer architecture, per- 
formance evaluation, and systems design, 
email: pvo@daffy.sps.mot.com 


Authors’ Present Address: Motorola, 

6501 William Cannon Drive West, Austin, 
TX 78735-8598. 


DEENE OGDEN is a senior engineer at 
IBM’s Somerset Design Center who was 
the 603 control logic team leader. Current 
research interests include microprocessor 
and computer design. Author’s Present 
Address: International Business Ma- 

chines Corporation, 11400 Burnet Road, 
Austin, TX 78758; email: deene@ 

austin.ibm.com 


In this article, the terms “PowerPC 603 Micro- 
processor” and “603” are used to denote the 
second microprocessor from the PowerPC Ar- 
chitecture family. PowerPC, PowerPC Architec- 
ture, PowerPC 601, PowerPC 603, and POWER 
are trademarks of the International Business 
Machines Corporation, and are used under li- 
cense from International Business Machines 
Corporation. IBM is a registered trademark of 
International Business Machines Corporation. 
Motorola is a trademark of Motorola, Incorpo- 
rated. Apple is a trademark of Apple Computer 
Corporation. SPEC, SPECint92 and SPECfp92 
are trademarks of Standard Performance Evalu- 
ation Corporation. 


Permission to copy without fee all or part of this 
material is granted provided that the copies are 
not made or distributed for direct commercial 
advantage, the ACM copyright notice and the 
title of the publication and its date appear, and 
notice is given that copying is by permission of 
the Association for Computing Machinery. To 
copy otherwise, or to republish, requires a fee 
and/or specific permission. 


© ACM 0002-0782/94/0600 $3.50 


DO YOU DAVE YOUR 
EEECTUONIC MAIL 
ADDUESS 
0I\I ACM.ORG YET? 



Setting up a Mail Forwarding 
(only $10) or Full Service 
account on ACM.org is as 
easy as Emailing ACM 
Network Services at: Account- 
lnfo@ACM.org 
or calling 1-817-776-6876. 

Or if you prefer, write: 
ACM Network Services 
P.O.Box 21599, 
Waco,TX 76702 or 
Fax:1-817-751-7785. 



THOUSANDS OF YOUR 
COLLEAGUES TURN TO 
ACM's TRANSACTIONS 
JOURNALS FOR THE 
CRITICAL INFORMATION 
THEY NEED ON: 

♦ Computer Systems 

♦ Database Systems 

♦ Graphics 

♦ Information Systems 

♦ Programming Languages 
and Systems 

♦ Mathematical Software 

♦ Modeling and Computer 
Simulation 

♦ Networking 

♦ Software Engineering and 
Computer Simulations 

To subscribe to any or all of 
ACM's Transactions Journals, 
call ACM's Member Services 
Department: 

1-212-626-0500 (in metro NY and outside U.S.| 
1-800-342-6626 (in U.S. and Canada) 
Fax:212-944-1318 

Email:ACMFlELP@ACM.org or write: ACM 
1515 Broadway, New York, NY 10036. 


42 


June 1994/Vol.37, N 0.6 COMMUNICATIONS OP THE ACM 





Brad W. Suessmith and George Paap III 






3 owerPC 603 Microprocessor 

Power Management 

A ddressing the need for long battery life in portable applications and environmental 
:icerns about energy consumption requires microprocessors with low power consump- 
:i as well as high performance. A primary design goal of the PowerPC 603 microproces- 
* . isas to provide sophisticated power management without compromising next-genera- 
.1 performance. As a result, the 603 is ideal for portable applications such as laptop 
mputers in addition to Energy Star compliant desktop computers. The PowerPC 603 
:,ws the system designer to control energy consumption through both hardware and 
■Jtware means as well as providing automatic internal power management. 


: : .verPC 603 Features 

Iht 603 contains numerous power 
■sanagement (PM) features: 

• 5 am four-layer metal CMOS pro- 

• 3 3Y power supply 

• 5Y or 3.3V compatible inputs 

• inctional units can be dynami- 
I powered up and down 

• 7 ur power states 

• Fully static design 

• S r duced-pinout mode 

• --bit or 32-bit data bus 

: - - c Power Management 

sophisticated feature of the 
^erPC 603 is dynamic power man- 
, ~ent (DPM), which automatically 
' up and down the individual 
ition units on the chip, based on 
contents of the instruction 


stream. For example, if no floating- 
point instructions are being executed, 
the floating-point unit is automati- 
cally powered down. Power is not ac- 
tually removed from the execution 
unit; instead, each execution unit has 
an independent clock input, which is 
automatically controlled on a clock- 
by-clock basis. Since CMOS circuits 
consume negligible power when they 
are not switching, stopping the clock 
to an execution unit effectively elimi- 
nates its power consumption. Setting 
a bit in an internal hardware register 
enables DPM. 

Static Power States 

The most unique feature of the 603 is 
the inclusion of four power states: 
Full On, Doze, Nap, and Sleep. Soft- 
ware selects these modes by setting 
one (and only one) of the three 


power-saving mode bits. The 603 
provides a separate interrupt and in- 
terrupt vector for power manage- 
ment: the System Management Inter- 
rupt (SMI). The 603 also contains a 
decrement timer that allows it to 
enter the Nap or Doze mode for a 
predetermined amount of time and 
then return to Full On operation 
through the Decrementer interrupt 
(DI). Note that the 603 cannot switch 
from one power management mode 
to another without first returning to 
Full On mode. The Nap and Sleep 
modes disable bus snooping; there- 
fore a hardware handshake is pro- 
vided to ensure data cache coherency 
before the 603 enters these power 
management (PM) modes. Table 1 
summarizes the four power states, 
which are further detailed in the fol- 
lowing paragraphs. 


COMMUI 





INS OP THE ACM June 1994/Vol.37, N0.6 43 


Table i. PowerPC 603 power management states 


PM mode 

Functioning units 

Activation method 

Full On Wake Up method 

Full On 
(with DPM) 

Requested logic by 
demand 



Doze 

Bus snooping 
Data cache as needed 
Decrementer timer 

Controlled by software 

External asynchronous interrupts 

Decrementer interrupts 

Reset 

Nap 


Controlled by 
hardware 

External asynchronous interrupts 


Decrementer timer 

and software 

Decrementer interrupts 
Reset 

Sleep 

None 

Controlled by 
hardware 
and software 

External synchronous interrupts 
Reset 


Full On without DPM. Full On 
mode without DPM provides: 

• Default state 

• All functional units are operating at 
full processor speed at all times 

Full On with DPM enabled. Full On 
mode with DPM provides on-chip 
power management without affecting 
the functionality or performance of 
the PowerPC 603. 

• Required functional units are oper- 
ating at full processor speed 

• Functional units are clocked only 
when needed 

• No software or hardware interven- 
tion required after mode is set 

• Software/hardware and perfor- 
mance transparent 

Doze. Doze mode disables most 
functional units but maintains data 
cache coherency by enabling the bus 
interface unit and snooping. A snoop 
hit will cause the PowerPC 603 to 
enable the data cache, copy the data 
back to memory, and then disable the 
cache. 

• Most functional units disabled 

• But snooping and time base/ 
decrementer still enabled 

• Several methods of returning to 
Full On mode 

— assert INT, SMI, or DI inter- 
rupts 

— hard or soft reset 

• Transition to Full On state takes no 
more than a few processor cycles 

• PLL running and locked to 
SYSCLK 


Nap. The Nap mode disables the 
PowerPC 603 but still maintains the 
Phase Locked Loop (PLL) and the 
decrement timer. The timer can be 
used to wake up after a programmed 
amount of time. Because bus snoop- 
ing is disabled for Nap and Sleep 
mode, a hardware handshake using 
the Quiesce request and acknowledge 
signals are required to maintain data 
cache coherency (see Sleep mode). 

• Time base/decrementer still en- 
abled 

• Most functional units disabled (in- 
cluding bus snooping) 

• All nonessential input receivers dis- 
abled 

• Several methods of returning to 
Full On mode 

— assert INT, SMI, or DI inter- 
rupts 

— hard or soft reset 

• Transition to Full On takes no 
more than a few processor cycles 

• PLL running and locked to 
SYSCLK 

Sleep. Sleep mode consumes the 
least amount of power of the four 
modes, since all functional units are 
disabled. To conserve the maximum 
amount of power, the PLL may be 
disabled and the SYSCLK may be 
removed. Before the PowerPC 603 
enters the Nap or Sleep mode, the 
603 will assert the QREQ signal to 
indicate that it is ready to disable bus 
snooping. When the system has en- 
sured that snooping is no longer nec- 
essary, it will assert QACK and the 


PowerPC 603 will enter the Sleep or 
Nap mode. 

• All functional units disabled (in- 
cluding bus snooping and time base) 

• All nonessential input receives dis- 
abled 

— Internal clock regenerators dis- 
abled 

— PLL still running (see following) 

• Sleep mode sequence 
— set Sleep bit 

— 603 asserts quiesce request 
(QREQ_) 

— system asserts quiesce acknowl- 
edge (QACK_) 

— 603 enters sleep mode after sev- 
eral processor clocks 

• Several methods of returning to 
Full On mode 

— assert INT or SMI interrupts 
— hard or soft reset 

• PLL may instead be disabled and 
SYSCLK may be removed while in 
Sleep mode 

• Return to Full On mode after PLL 
and SYSCLK disabled in Sleep mode 

—enable SYSCLK 
— reconfigure PLL into desired 
processor clock mode 

— system logic waits for PLL start- 
up and relock time (100 fi sec) 

— system logic asserts one of the 
sleep recovery signals (e.g., INT or 
SMI) 

Power Management Mode 
Transitions 

The PM mode transitions are typi- 
cally controlled by the operating sys- 
tem (OS) or by the chipset hardware. 


44 June 1994/Vol.37, N0.6 COMMUNICATIONS OP THE ACM 



fee: :he Sleep or 

[ disabled (in- 
ar.d time base) 
l: receives dis- 

gr aerators dis- 

>ee following) 

e 

irsce request 

ir>ce acknowl- 

t ade after sev- 

returning to 

I interrupts 

1 disabled and 
pved while in 

c de after PLL 
in Sleep mode 

, into desired 

lor PLL start- 
le* /rsec) 

Is one of the 
le.g., INT or 

pde 

a- ns are typi- 
jc r erating sys- 
hardware. 


r example, as part of the OS 
j: i ding/start-up process, one of the 
L wer-saving modes is selected by 
setting the appropriate power man- 
:ement mode bits. The proper 
pi wer management mode is selected 
n the basis of whether any other 
evices will be performing DMA 
mansfers that need to be snooped by 
he 603. If no snooping action is re- 
mired, then Nap mode is chosen, 
iherwise Doze mode is used. If the 
me base is not needed either, or if its 
; roper setting can be determined 
Jrom an external source (such as a 
eal-time clock) then the Sleep mode 
an be selected, offering the lowest 
. awer consumption. After setting the 
: roper power management bit and 
rearing whatever start-up processes 
ire deemed necessary, the OS then 
creates a new low(est) priority process 
an “idle” process). Now, when all the 
: rher processes are idle, this low pri- 
rity process (or thread) is executed, 
: lacing the 603 in the selected power- 
iown mode. Whenever the OS time 
; ounter (either the internal Time 
Base/Decrementer or an external 
: Duntdown timer) expires and inter- 
rupts the processor, the 603 switches 
j Full On mode and processes the 
nterrupt service routine. The inter- 
rupt service routine calls the OS 
:heduler to update priorities. If no 
:her processes are active yet, the 603 
Ten returns to the “idle” process and 
raters the low-power mode. 

Another possible method is to rely 
n the PC chipset to monitor the bus 
activity of the 603 (similar to how 
most ^86-based PCs currently oper- 
ate! When activity timers in the chip- 
set detect no activity by the system’s 
mmponents for a predetermined 
rngth of time, the chipset generates a 
>MI interrupt. In the interrupt han- 
dler for this routine, the 603 checks 
the interrupt status register in the 
chipset, indicating whether the inter- 
rupt was caused by a CPU or a pe- 
ripheral activity timer expiring. If it 
was a CPU activity timer expiring, the 
: 03 calls a routine that checks the sta- 
:us of the scheduler to decide 
whether or not the 603 was truly in- 
active (the 603 could have been sim- 
ply executing a tight loop whose code 
and data spaces fit inside the on-chip 
.aches). If the 603 was indeed inac- 
tive, the routine would then place the 


The Making of the 


Table 2. Typical power consumption 


PM Mode 

I 

50 MHz 

frequency (Internal 
66 MHz 

80 MHz 

Full On no DPM 

1.78 W 

2.18 W 

2.54 W 

Full On with DPM 

1.52 W 

1.89 W 

2.20 W 

Doze 

242 mW 

307 mW 

366 mW 

Nap 

89 mW 

113 mW 

135 mW 

Sleep 

66 mW 

89 mW 

105 mW 

Sleep 
no PLL 

15 mW 

18 mW 

19 mW 

Sleep 

no PLL on SYSCLK 

<5 mW 

<5 mW 

<5 mW 


Table 3. 603 Performance 



Frequency (Int/Ext) in MHz 

Benchmark 

50/50 

60/66 

80/40 

SPECint92 

47 

60 

75 

SPECfp92 

53 

70 

85 



Interrupt 

Handler 


SMI, El Dl 


Power Management Code 


Set 

Doze 

Bit 


Full On 
Mode 


SMI, El 
Dl 


Doze 


SMI, El 
Dl 


SMI, El 


Set Sleep Bit 
QREQ 


Set Nap Bit 
QREQ 


System 

Ack 


QACK 


Nap 


System 

Ack 


}f QACK 


Sleep 

m 


b 


PLL Off 
SYSCLK Off 


Deep 

Sleep 


SYSCLK On 
PLL On 


Figure i. PowerPC 603 power management mode transitions 


communications of the ACM June 1994/Vol.37, No.6 ^5 



l*OVVEIK 



The IBM PowerPC Architecture: 
A Specification for the New 
Family of RISC Processors 

IBM 

The IBM PowerPC Architecture is the 
official detailed technical description of 
the IBM PowerPC architecture and its 
hardware conventions, making it an es- 
sential reference for designers of hard- 
ware and system software as well as ap- 
plication programmers developing prod- 
ucts for any implementation of the 
PowerPC family of RISC microproces- 
sors. It includes the base instruction set, 
storage model and all related facilities 
available to application programmers, the 
Time Base as seen by the application pro- 
grammers and a full description of the 
system instructions. 

June 1 994; approx 600 pages; 
cloth; ISBN 1-55860-316-6; $49.95 

IBM Power and PowerPC: 
Principles ♦ Architecture ♦ 
Implementation 

Shlomo Weiss, Tel Aviv University 
James E. Smith, Cray Research 
The RS/6000 and the PowerPC 601 
implementations serve as in-depth case 
studies for hardware designers, develop- 
ers, software engineers, and performance 
analysts. Assuming only minimal hard- 
ware background, the authors describe 
basic concepts such as pipelining, 
caches, and superscalar processing, be- 
fore proceeding to detailed discussions 
of the POWER and PowerPC architec- 
tures and their implementations. The 
presentation of alternative design ap- 
proaches and trade-offs taken in the de- 
sign process, combined with compari- 
sons to the DEC Alpha processor, make 
this an ideal introduction for technical 
managers and newcomers alike. 

April 1 994; approx 600 pages; cloth; 
ISBN 1-55860-279-8; $54.95 

To Order: S 

* US & Canada 800-745-7323 

* International: 4 1 5-392-2665 

* Fax:415-982-2665 

* Email: morgan@unix.sri.com 


llonism Kaiifiiuiim 

340 Pine St. 6th Floor 
San Francisco, CA 94104-3205 


603 in one of the PM modes, after 
checking the power status of the 
other peripherals in the system. It 
should be placed into the doze mode 
if any other peripherals with DMA 
capability are still powered up. If no 
other peripherals with DMA capabil- 
ity are powered up, or it is known 
that no DMA activity will take place, 
then the 603 can be placed into either 
Nap or Sleep mode. 

The power management mode 
transitions of the PowerPC 603 are 
shown graphically in Figure 1. De- 
pending on the power management 
scheme used, a decrementer inter- 
rupt, SMI, or external interrupt (El) 
invokes the interrupt handler rou- 
tine. To enter the Doze mode, the 
doze bit is set. The PowerPC 603 will 
remain in Doze mode until an inter- 
rupt returns it to Full On mode. The 
PowerPC 603 then invokes the inter- 
rupt handler to determine its next 
action. 

Transitioning to Nap mode re- 
quires a hardware handshake, since 
the PowerPC 603 will not enforce 
data cache coherency in this mode. 
Once the nap bit has been set and the 
PowerPC 603 is ready to Nap, it.indi- 
cates this by asserting QREQ. When 
the system has assured that snooping 
is no longer required, it asserts 
QACK. This allows the system to con- 
trol when the PowerPC 603 actually 
enters Nap mode. 

Sleep mode also requires a QACK 
from the system before it is entered. 
Since the decrementer timer is not 
functioning in Sleep mode, only a 
SMI or El will wake the part up. To 
conserve the maximum amount of 
power, it is also possible to put the 
PowerPC 603 into Deep Sleep by first 
turning off the PLL and then the 
SYSCLK. Before the PowerPC 603 
can recognize the SMI or El, the 
SYSCLK must be reapplied and then 
the PLL must be turned on. 

32-hit data bus mode and reduced- 
pinout mode. The 603 supports an 
optional 32-bit data bus mode which 
is configured at reset. This mode 
breaks a 64-bit data access into two 
32-bit bus transactions. In addition, 
the 603 offers a reduced pinout mode 
to lower power consumption (by 
idling the switching of numerous 
pins) while the 603 is running in 32- 
bit data bus mode. 


PowerPC 603 power consumption. 
How all these power management 
features benefit the portable system 
designer is perhaps illustrated more 
clearly by the (preliminary) typical 
power consumption figures given in 
Table 2. 

Power PC 603 performance esti- 
mates. Table 3 lists the simulated 
benchmark results for the 603, as- 
suming a 1MB second-level cache. 

Summary 

The power management features of 
the PowerPC 603 enable portable sys- 
tems designers to provide enhanced 
performance for portable computers 
while conserving energy. □ 


References 

1. Motorola Inc. Power PC™ 603 RISC 
Microprocessor. Tech. Sum. — Rev. 1, 
1993. 


About the Authors: 

BRAD SUESSMITH is an applications 
engineer with Motorola’s RISC micropro- 
cessor division. Current research interests 
include high-performance microprocessor 
architecture, advanced personal com- 
puter architecture, and low-power/port- 
able system design. 

GEORGE PAAP III is an applications 
engineer with Motorola’s RISC micropro- 
cessor division. Current research interests 
include microprocessor architecture, 
PowerPC assembly programming, multi- 
media authoring, rendering. 


Authors’ Present Address: Motorola, 

Inc., 6501 William Cannon Drive West, 
Austin TX 78735-8598; email: 
{brad_suessmith, george_paap}@email. 
sps.mot.com 


In this article, the terms “PowerPC 603 Micro- 
processor” and “603” are used to denote the 
second microprocessor from the PowerPC Ar- 
chitecture family. 


Permission to copy without fee all or part of this 
material is granted provided that the copies are 
not made or distributed for direct commercial 
advantage, the ACM copyright notice and the 
title of the publication and its date appear, and 
notice is given that copying is by permission of 
the Association for Computing Machinery. To 
copy otherwise, or to republish, requires a fee 
and/or specific permission. 

© ACM 0002-0782/94/0600 $3.50 


CIRCLE 93 ON READER SERVICE CARD 





Ali Poursepanj 


The 



e r P C 


The PowerPC Performance 
Modeling Methodology 

The PowerPC performance modeling was based on trace-driven simulation, where the 
microprocessor organization is specified as a model, benchmark traces are generated and 
applied to the model, and performance data is measured and analyzed. In spite of the 
advantages of trace-driven simulation, which will be described later, meaningful bench- 
mark traces are large and require prohibitive amounts of storage and time to analyze. 
Trace-sampling techniques were used to reduce the cost of trace generation space and 
'imulation execution time. Trace-sampling techniques have previously been used for 
evaluation of the cache memory systems [3, 4] and superscalar microprocessor systems 
5, 6]. Much smaller traces can be generated using trace-sampling techniques (<1% of 
;he actual size) [4], 


There are a number of advantages 
:o a trace-driven approach. There is 
no need to generate program results 
:n a trace-driven model. Designers 
:an concentrate on performance 
implications of architectural features 
without having to worry about gener- 
ating correct results or modeling sys- 
:em I/O and other overhead. Traces 
;an be statistically sampled to reduce 
'imulation time with a very accept- 
able reduction in accuracy. This al- 
lows relatively quick analysis of large 
oenchmarks, such as the SPEC 92 
suite. 

The usefulness of the PowerPC 
oerformance models does not end at 
:he processor design phase. Compiler 
developers need models to help tune 


compilers to the PowerPC instruction 
set. System designers need to make 
trade-offs on system cache and mem- 
ory designs. System software groups 
need to use processor models to mea- 
sure and tune the performance of li- 
braries and low-level operating svs- 
tem services. Prospective customers 
who want to run their specific appli- 
cations on the performance model to 
determine if the part meets their 
needs are viable users of the model. 
The PowerPC performance model is 
also used by the design team to assis: 
in detecting performance bugs (e.g.. 
unexpected bubbles) in the final logic 
model. 

The following sections will discuss 
performance modeling and the trace 


generation and simulation methodol- 
ogy used for the PowerPC micropro- 
cessors. 

Trace-driven Performance 
Modeling 

Performance Models 

Processor performance models can 
be roughly divided into two types: 
trace-driven models and execution- 
based models. Trace-driven models 
simulate the data flow of the proces- 
sor, but do not actually execute in- 
structions or generate results. In- 
stead, they are fed a program trace, 
called a dynamic trace, which con- 
tains the dynamic sequence of in- 
struction addresses, instruction op- 
codes, and data addresses that 


communications of the acm June 1994/V61.37, No.6 47 



Trace sampling turned out to be a good technique for 
increasing the productivity of performance modeling. 


occurred during execution. Execu- 
tion-based models, on the other 
hand, execute code much as a real 
processor would, generating results 
and using those results in conditional 
branches and so forth. The benefits 
and drawbacks of each simulation 
model are described in the following 
paragraphs. 

Although the trace-driven model 
offers advantages listed in the previ- 
ous section, meaningful traces can 
require a large amount of storage and 
must be recorded from an existing 
system that has the same instruction 
set. Correctness of the traces or 
model behavior may be difficult to 
verify, since a trace-driven model 
does not generate program results. 
Data-dependent timing situations 
cannot be modeled in a trace-driven 
model. 

In an execution-based model, pro- 
gram executables take up much less 
space than dynamic traces and do not 
require an existing system for trace 
generation. Correctness of the model 
behavior is easier to verify, since the 
model must generate the correct pro- 
gram results. Therefore, microarchi- 
tectural logic problems can be discov- 
ered and corrected at an earlier stage 
in the design cycle. Execution-based 
models provide more accuracy than 
trace-driven models and can execute 
faster than trace-driven models. A 
drawback to this method is that large 
benchmarks take a long time to run, 
since they require the full processor 
to be modeled, instead of just the data 
flow model. The longer run time 
could seriously deduct from produc- 
tivity in making performance- 
oriented design trade-offs. Table 1 
compares some of the features of the 
execution-based and trace-driven 
models. 

Trace Generation 

Instruction traces, the records of the 
sequence of memory addresses refer- 
enced by a processor during execu- 
tion, are widely used by designers 


and researchers to analyze the per- 
formance of computer systems. Trace 
generation refers to the process of 
recording the executed instructions. 
To generate an instruction trace, pro- 
grams are compiled and executed, 
and traces are recorded during the 
program execution in a variety of 
ways. Traces can be generated 
through hardware-based monitors, 
interrupt-based methods, simulation- 
based methods, microcode-based 
methods, or instrumented program- 
based methods [9]. 

Hardware-captured traces. The 
processor’s references to memory are 
recorded directly by a hardware per- 
formance monitor logic. The cap- 
tured references are typically physical 
memory references, which are desir- 
able for the cache memory simulation 
but are not as useful as virtual ad- 
dresses for finite cache processor per- 
formance issues. Complexity, cost, 
and lack of flexibility are the primary 
limitations of this approach. 

Interrupt-based traces. In inter- 
rupt-based techniques, the program 
is interrupted after execution of each 
instruction, and instruction address 
information is recorded. The prob- 
lem with this approach is that the 
need for interrupting every instruc- 
tion slows down the trace generation. 
Another problem is that tracing oper- 
ating system code can be difficult. 

Simulation-based traces. The simu- 
lation-based method is based on the 
use of a cycle-level architectural simu- 
lator that executes the simulated ma- 
chine at the binary level. By imple- 
menting the ‘programmer model’ of 
the machine exactly, real workloads 
(both operating system and applica- 
tions) can be traced. By implement- 
ing a cycle-level model, not only CPU 
memory accesses, but phenomena 
such as system bus, I/O bus, processor 
and I/O cache and buffer activities 
can be traced in detail. However, the 
speed of trace collection in this 
method is very slow. 

Microcode-based traces. The mi- 


crocode-based method enables cap- 
turing of long traces for multipro- 
grammed user and operating system 
activities. The advantage of this 
method is that it is fast and allows 
trace generation of existing and fig- 
ure processors. The problem with 
this approach is that many RISC pro- 
cessors do not use microcode and 
CISC chips have their microcode in 
their read-only memory (ROM), 
which is not modifiable. 

Instrumented program-based traces. 
In the instrumented-based trace gen- 
eration, the program source code 
(operating system or application) is 
directly modified for trace genera- 
tion. This method has attracted con- 
siderable attention and can speed up 
the trace-generation process. 

An IBM internal interrupt-based 
trace-generation tool was used for 
trace generation of the SPEC 92 and 
Transaction Processing Council 
(TPC) benchmarks. SPEC 92 and 
TPC source code were compiled and 
executed on RISC System/6000 serv- 
ers in the Somerset Design Center, 
and traces were recorded. The TPC 
traces were full continuous traces, 
while the SPEC 92 traces were sam- 
pled traces. Sampling techniques 
were used to increase the simulation 
turnaround time during the perfor- 
mance analysis of the PowerPC 
microprocessors. Trace sampling 
turned out to be a good technique for 
increasing the productivity of the 
performance modeling. 

Trace Sampling for Reducing 
the Cost of Simulation 

Trace sampling has been used in the 
last two decades to reduce the cost of 
storage and execution time of the 
trace-driven simulation. Much 
smaller instruction traces can be gen- 
erated to represent the full continu- 
ous trace with much lower costs [6]. 
The “representative” accuracy of the 
sampled trace in terms of how well it 
represents the actual workload is an 
obvious concern. Experimental re- 


48 June 1994/Vol.37, N0.6 COMMUNICATIONS OF THE ACM 



suits [6] show that much smaller sam- 
pled traces can work as well as the 
: :«ntinuous full traces for perfor- 
mance estimation while resulting in a 
mznificant cost reduction. 

Table 2 compares sampled traces 
md continuous traces for the SPEC 
a integer benchmarks. The six 
SPEC 92 integer benchmarks were 
Emulated using full and sampled 
races and the simulation results were 
c : mpared. As shown in this table, the 
difference between the geometric 
mean instructions per cycles (IPC), 
ased to calculate the SPEC ratios, of 
die sampled traces and mean IPC of 
die continuous traces is less than 2% 


The 

(1.000 vs. 0.9807) for a simulation 
time cost reduction of more than 
800X (39762 vs. 48). 

Notice that although the overall 
IPC geometric mean is within 2%, the 
error margin for individual traces can 
go up to 13% (for sc). The reason for 
this difference is explained in more 
detail in [6]. It was noticed that the sc 
has a high percentage of conditional 
branch and high cache-miss ratios 
compared to other integer applica- 
tions. Apparently, one-million in- 
struction uniform sampling did not 
work for sc in terms of IPC the way it 
worked for the other SPEC 92 integer 
applications. Figure 1 shows the sam- 



□ king of t h 


pled and continuous trace IPC for the 
SPEC 92 li. 

Ideally, sampled traces and contin- 
uous traces should generate the same 
amount of IPC when both are simu- 
lated on the same processor model 
under the same conditions. This may 
not be the only condition for qualify- 
ing a sampled trace to be representa- 
tive of a full trace. Simulation results 
in [6] show that not only mean IPC 
but a broad set of other performance 
data (such as instruction frequency 
mix, nature of the branches in the 
trace, percent of instruction stalls in 
the pipeline) can closely and accept- 
ably correlate between sampled and 


"able 1. Comparison of execution-based and trace-driven models 


Metric 

Trace-driven model 

Execution-based model 

Accuracy 

Moderate 

High 

Execution speed 

Low 

High 

Usability for design verification and code optimization 

Low 

High 

Fast performance estimation capability of the 
standard benchmarks using sampled traces 

High 

Low 

Capability for study of the future designs 

High 

Low 

Usability for cache and CPU design trade-offs 

High 

Low 


-able 2. Comparison of sampled and continuous traces of SPEC 92 integer applications 


SPEC 92 

integer 

benchmark 

Continuous 

trace 

IPC 

Sampled 

trace 

IPC 

Continuous 

trace 

length 

Sampled 

trace 

length 

Continuous 

trace 

simulation 
time (minutes) 

Sampled 

trace 

simulation 
time (minutes) 

compress 

1.0000 

0.9801 

80,537,933 

1,000,000 

672 

8 

eqnmt 

1.0000 

1.0373 

1,067,642,760 

1,000,000 

8,898 

8 

espresso 

1.0000 

1.0645 

1,141,805,665 

1,000,000 

9,516 

8 

S cc 

1.0000 

0.9539 

114,605,018 

1,000,000 

960 

8 

li 

1.0000 

0.9944 

2,000,000,00 

1,000,000 

16,668 

8 

sc 

1.0000 

0.8669 

365,641,328 

1,000,000 

3,048 

8 

CINT92 

1.0000 

0.9807 



39,762 

48 


; :npress is a text compression and decompression utility that 
e:uces the size of the named files using adaptive Lempel-Aiz 
rrning. Eqntott translates a logical representation of a Boolean 
= : jation to a truth table. The primary computation is sort opera- 
■ : n. Espresso is a tool for the generation and optimization of 
:-:grammable logic arrays (PLAs). It characterizes work in the 
EDA market and logic simulation and routing algorithm areas. 


Ccc is a popular compiler that compiles 76 input files. This bench- 
mark represents one phase (compile) of software development 
application. Li is a Lisp interpreter that solves nine queens’ prob- 
lems. The backtracking algorithm is recursive and represents 
object-oriented programming type problems. Sc is a spread- 
sheet benchmark that calculates budgets, SPECmarks, and 15- 
year amortization schedules [8J. 


COMMUNICATIONS OP THE 


ACM June 1994/Vol.37, N0.6 49 




ng of the PowerPC 


full-trace simulation data for a class of 
applications such as SPEC 92 integer 
benchmarks. 

Table 3 compares mean IPCs, 
number of branches, source of delays, 
and instruction frequency mixes for 
the sampled and continuous traces 
for the SPEC 92 benchmark li. These 
numbers have been calculated for a 
continuous trace simulation using a 


finite cache and a one-million- 
instruction-trace sampled trace using 
a statistical cache for the li applica- 
tion. The characteristics listed in the 
table affect the performance of the 
PowerPC microprocessors and are 
used to check the representativeness 
of the sampled traces. That is, not 
only the IPC of the sampled and the 
continuous traces should be very 



close, but their simulation behavior 
should be very similar as well. 

Simulation Methodology 

A simulation tool called Basic RISC 
Architecture Timer (BRAT) was de- 
veloped to implement the trace- 
driven methodology for design trade- 
offs and performance estimation for 
the PowerPC microprocessors. BRAT 
is a trace-driven, parameterized, sim- 
ulation tool that has been used for 
performance modeling of the super- 
scalar-based PowerPC microproces- 
sors at the Somerset Design Center in 
Austin [5]. BRAT covers a large su- 
perscalar design space, allows quick 
turnaround for designers’ what-if 
analysis and provides appropriate 
accuracy. The BRAT tool has been 
used for performance modeling of 
the PowerPC 601, PowerPC 603, 
PowerPC 604, and PowerPC 620 
microprocessors. 

BRAT provides several hundred 
parameters to model variations in the 
PowerPC superscalar microprocessor 
designs. These parameters are used 
to model: 


Figure i. Sampled vs. Continuous Trace IPC for the SPEC 92 //. 

This figure shows the sampled and continuous trace mean IPC vs CPU 
cycle time for the SPEC 92 //application. The SPEC 92 li continuous 
trace contains nearly two billion instructions, where the //sampled 
trace contains only one million instructions. The continuous trace 
simulations used a finite cache model and the sampled trace simula- 
tion used statistical cache miss ratios taken from the full trace cache 
simulation. Notice that the continuous trace mean IPC reaches the 
steady state very quickly and the sampled trace mean IPC is very close 
to the continuous trace mean IPC. These characteristics have been 
seen on most of the SPEC 92 benchmarks [61 making the sampling of 
the SPEC 92 benchmarks very attractive for improving the productiv- 
ity of the performance modeling. 



Figure 2. Instruction flow through a superscalar organization 


• Cache organization 

• Bus attributes 

• Execution unit pipeline 

• Reservation stations 

• Branch prediction schemes 

• Order of instruction issue 

• Order of instruction execution 

• Order of instruction completion 

• Register renaming 

• Data dependency 

• Register file read and write ports 

• Synchronization and serialization 

• Exception 

• Others 


Superscalar architecture concepts 
have been described in detail in [2]. 
Figure 2 shows a simplified data flow 
in a superscalar organization. In- 
structions ar e fetched from the instruc- 
tion cache (or memory), decoded, dis- 
patched, executed, and retired at the 
CPU clock boundaries. Instructions 
flow through the CPU pipeline 
stages, wait in various buffers and 
queues, (depending on the hardware 
organization and program character- 
istics), spend some number of cycles 
in the execution units (similar to 
being served by servers in a queuing 
system), and retired through a com- 


50 June 1994/V61.37, No.6 communications of the acm 



: letion stage at some rate (leaving the 

- stem). In this environment the 
—Lean I PC of the processor can be 
viewed as the throughput of the 
;ueuing system. 

BRAT allows modeling of a recon- 

- -urable queuing system that corre- 
ates to various potential pipeline 

r^anizations in the superscalar de- 
. _rn domain. In a simple queuing sys- 
:em, such as a drive-in bank, custom- 
r rs only wait if all the tellers are busy 
serving other customers. In the trace- 
riven simulation model, however, 
-.'■ructions may wait at different 

ges in the pipeline because of re- 
source conflicts, incorrect speculative 
execution, data dependencies, serial- 
eation, cache misses, and many other 
ri iSons [2]. 

BRAT parameters can be classified 
n two major groups. The first group 

parameters that are used to specify 
the hardware structure. These in- 
clude: the number of execution units, 
the number of pipeline stages of the 
3 ating-point execution unit, the size 


of cache, the size of the dispatch 
buffer, the instruction cache line size, 
and others. The second group is pa- 
rameters that affect the flow of in- 
structions and stalls through the 
pipeline. Some of these parameters 
are: order of execution, serialization 
rules, register-renaming, branch- 
prediction scheme, completion crite- 
ria, arbitration in a unified instruc- 
tion and data cache. 

BRAT accepts three inputs: 1) a 
dynamic instruction trace, represent- 
ing the workload; 2) a parameter file, 
representing the CPU model; and 
3) an architecture file, representing 
the supported instruction set. These 
inputs are specified either through 
the command line or through a 
graphical user interface. All the pa- 
rameters defined in the parameter 
file can be overridden through the 
command line options. Having this 
feature in the BRAT allows scripts to 
be written that vary parameters so 
that performance trade-offs can be 
quickly explored. 


The BRAT tool was made available 
to the PowerPC design teams from 
the beginning of the design stage for 
performance analysis and design 
trade-offs. A network of RISC 
System/6000 machines was used for 
running BRAT jobs in the Somerset 
Design Center. Figure 3 shows the 
general flow of a BRAT simulation. 

As the detailed design of the 
PowerPC microprocessors evolved, it 
was very important to make sure the 
BRAT performance models contin- 
ued to correlate with the designs. 
Special performance test cases were 
developed to verify the BRAT model 
against the logic model for the 
PowerPC processor [5]. There were 
three categories of performance test 
cases: 1) single instructions, in which 
each test case had one instruction; 
2) instruction sequences with an in- 
struction frequency distribution simi- 
lar to those of the SPEC92 bench- 
mark programs, and 3) selected inner 
loops of 10 of the SPEC92 benchmark 
programs. The selected inner-loops 


Table 5. Comparison of the sampled and continuous performance data for li 


Performance metrics 

Continuous trace 

Sampled trace 

Difference | 

Mean IPC (completed instruction per CPU cycle) 

1.0000 

0.9944 

-0.0056 

Number of taken branches 

1.0000 

1.0183 

+0.0183 

Delays due to lack of resources 

1.0000 

1.0000 

+0.0000 

Delays due to incorrect branch prediction 

1.0000 

1.0714 

+0.0714 

Delays due to data dependency 

1.0000 

1.0000 

+0.0000 

Percentage of top 10 instruction ! 

frequencies in the trace 

Load(L) 

22.33 

22.42 

0.09 

Conditional Branch(BC) 

14.34 

14.36 

+0.02 

Store(ST) 

9.95 

9.69 

-0.26 

Compare Immediate(CMPI) 

8.90 

8.76 

-0.14 

Add Immediate(AI) 

7.33 

7.42 

-0.09 

OR Immediate(ORIL) 

6.39 

6.39 

+0.00 

Branch(B) 

5.05 

5.09 

-0.04 

Compare Logical Immediate(CMPLI) 

3.25 

3.22 

-0.03 

Compare Logical l(CMPL) 

2.63 

2.67 

+0.04 

Load Byte and Zero(LBZ) 

2.59 

2.49 

-0.10. 


COMMUNICATIONS OF the ACM June 1994/Vol.37, No.6 51 


test cases proved most valuable, since 
they were actual benchmark code 
segments and uncovered design bugs 
related to performance. 

The average I PC difference be- 
tween BRAT simulations and logic 
model simulations when running the 
test cases listed previously was less 
than 5%. The difference was satisfac- 
torily attributed to some cycle charac- 
teristics that could not be modeled 
exactly in the trace-driven simula- 
tion, such as predicted instruction 
paths not being taken, load/store 
misses of the wrong path, and data- 
dependent latencies. 

Performance Estimation of 
Standard Benchmarks 

SPEC 92, and Transaction Processing 
Council (TPC) applications are used 
as standard benchmarks by many 
computer manufacturers to measure 
the performance of their systems. 


The SPEC 92 benchmarks consist of 6 
integer and 14 floating-point applica- 
tions [8]. TPC-A and TPC-B were the 
first two benchmark standards pro- 
vided by Transaction Processing 
Council (TPC). TPC is the de facto 
industry standards group for on-line 
transaction processing (OLTP) per- 
formance benchmarks. TPC-A is a 


superset of TPC-B. TPC-A can be 
described in terms of a hypothetical 
bank, which has one or more 
branches. Each branch has multiple 
tellers and many customers, each 
with an account. 

SPEC 92 and TPC traces were used 
in the Somerset Design Center for 
performance estimation and design 



Figure 3. BRAT simulation environment 



Options | 


207: bier 10x106 




File | Run | Skip _<| Step |>][ Cycle: 337 


Number of Cycles 
Upto Cycle Number 
Number of Instructions 
Inst Number 
Inst Address 
Opcodes 
Mnemonic 
Exec Resource 
Exec Unit 
Entry Point 
Auto Run/Stop 
Set Auto interval 


EPC: 6.689 






on- 


on-LI 


Dispatch 


(Priority: Lef t to RIpHt ) 


GP Renames 


mJ 


CR^Renamesj 


Data-LI 


Branch Queue 


n: i i 

FPU I BRU | CRLl) I 


LD/ST | FXU 


CR MO D Queue i 


CROP Queue | 


Load Queue | Finish St. Que | Complete St. Qua! 

1 ' - "' 1.... ' . 1 " 1 -“- i" " 7i - ;; ■ ' ' 


. 

1 


Figure 4. BRAT graphical user interface 


52 June 1994/V61.37, No.6 communications op the acm 







Cycle inst# 

0 0 

1 1 

2 3 

3 4 

4 4 


The Making of the 


Pipeline Report 

IB X FPU LSU B C Fi Co 

lyz . . 

I zAB . . 

I BCD . 

ICDEF . 

I CDEFGH 


I A | 
I . I 


I B 
I B 


y • 

zy 

z . 


I A 
I A 


(IB 

LSU 


Instruction buffer, X = Fixed-point unit, FPU = Floating-point unit, 
Load/store unit, B = Branch unit, C = CR unit, Fi = Finished, Co = Completed) 


Inst. Address 


Opcode Data Addr. 


Mnemonic 


y 

z 

A 

B 

C 

D 

E 


0x00000200 

0x00000204 

0x00000208 

0x0000020c 

0x00000210 

0x00000214 

0x00000218 


82420000 

80f20008 

39000000 

91070000 

81320000 

90690000 

81320004 


210clcl0 

210a5208 

210c80a8 

210a5200 

210c80a0 

210a5204 


1 rl8 , 0x0 (r2 ) 
1 r7,0x8(rl8) 
lil r8,0x0 
st r8, 0x0 (r7 ) 
1 r9 , 0x0 (rl8 ) 
st r3,0x0(r9) 
1 r'9, 0x4 (rl8 ) 


Cache Traffic Report 



Cycle 
no . 

R/S 

instruction 

number 

access 

type 

address 

penalty 

instr 
0 12 

data 
0 12 

CB 
1 2 

i 

0 

+ 


fetch 

00000200 

0 

h 

h 

_ _ 

_ 

i 

1 

+ 


fetch 

00000208 

0 

h 

h 

- - 

- 

i 

2 

+ 


fetch 

00000210 

0 

h 

h 

- - 

- 

d 


+ 

0 

load 

210clcl0 

0 

- 

- 

h h 

- 

i 

3 

+ 


fetch 

00000218 

0 

h 

h 

- _ 

_ 

i 

4 

+ 


fetch 

00000220 

0 

h 

h 

- - 

_ 

d 


+ 

1 

load 

210a5208 

0 

- 

- 

h h 

- 


Symbol : 

i=instruction, d=data 

+=request/serviced, r=request, s=serviced 

h=hit, m=miss, b=busy, x=hold, $=forwarding data available, *=in service 


Log Report 

total cycles: 100 
total inst completed: 64 
Completed instructions per cycle (IPC) 0.64 


* Dispatch Statistics 


% cycles dispatch occurred: 51 
% stall due to out of resources: 30 
% stall due to mispread branches: 15 


* Branch Statistics 


total brs : 13 
total guessed brs: 2 
total guess right brs: 1 
total guess wrong brs: 1 


• Cache Access Statistics 


• Completion Statistics 


• Utilization of the Units 


• System Parameters 


Figure 5. 

The pipeline, 
cache traffic, 
and log report 
screen samples 


COMMUNICATIONS op the ACM June 1994/Vol.37, No.6 53 



trade-offs of the PowerPC micropro- 
cessors. One million instruction- 
sampled traces (200 samples of 5,000 
continuous instructions), uniformly 
sampled over the entire application, 
turned out to be sufficient as well as 
efficient for performance estimation 
for each SPEC 92 benchmark [6]. 
Sampling of the TPC traces was not a 
trivial task, due to the nature of their 
cache memory accesses in which 
cache coldness effects at the start of 
the samples were significant. This 
resulted in unacceptable deviation of 
the sampled traces from the IPC of 
the full trace. Therefore, full traces 
were used for performance estima- 
tion and design trade-offs. 

In order to further accelerate the 
process of the performance estima- 
tion, in addition to using the sampled 
traces, a distributed network of high- 
performance IBM RISC System/6000 
servers was used for performing sim- 
ulation jobs in the Somerset Design 
Center. Such an environment re- 
sulted in a new estimate of the pro- 
cessor performance for the designer’s 
what-if questions using SPEC 92 
benchmarks in less than 15 minutes. 
Without sampling and distributed 
computing, this time would be liter- 
ally increased to days. 

The BRAT tool was documented 
and made available to all the Somer- 
set design team members from the 
beginning of the design cycle of the 
PowerPC microprocessors. More 
than 40 design engineers have used 
this tool to make at least one decision 
to assess performance implications 
when changes were required. BRAT 
will be enhanced and used for the 
performance analysis and design 
trade-offs of the future projects in the 
Somerset Design Center. 

Other Applications 

BRAT has also been used by other 
IBM PowerPC-based development 
organizations for cache and bus de- 
sign trade-offs, and by the compiler 
optimization group to study the na- 
ture of the instruction sequences gen- 
erated by the compiler. Cycle-by- 
cycle states of the CPU, as BRAT ana- 
lyzes the instruction trace, can be 
visualized either on a graphical user 
interface or encoded in an ASCII file. 
This information can assist the com- 
piler writer or system software devel- 


oper in optimizing code for a target 
PowerPC implementation. Using the 
BRAT graphical interface, one can 
look at the state of the hardware ele- 
ments at any cycle. Figure 4 shows a 
sample screen from the BRAT user 
interface. Users can set break points 
and conditionally stop the CPU exe- 
cution when certain conditions occur. 
Some of these features are listed as 
follows: 

• Watch the state of the machine 
cycle by cycle 

• Run and stop at cycle number n 

• Run and stop at instruction num- 
ber n 

• Run and stop at a specific address 

• Run and stop at a specific instruc- 
tion 

• Run and stop at a specific entry 
point by name 

• Backtrack n cycles 

• Set a book mark at an address 

• Return to an address by a book 
mark 

• Others 

In addition to the cycle-by-cycle 
state information displayed, BRAT 
produces several output files that 
contain detailed information about 
the run. The log report contains IPC, 
and accumulative nature of the 
branches, cache hit-and-miss ratios, 
load/stores, source of dispatch, execu- 
tion, and completion delays, instruc- 
tion frequency mix and utilization of 
all the resources in the system. The 
CPU pipeline report contains positional 
information of the instructions in the 
pipeline on a cycle-by-cycle basis. The 
cache traffic report contains cycle-by- 
cycle state information for cache re- 
quests. Figure 5 shows a sample 
screen output of the log report, pipeline 
report, and cache traffic report. 

BRAT was also used in Somerset 
for the performance verification of 
the high-level logic model. The func- 
tionality of the high-level logic model 
was verified through billions of cycles 
of simulation. The cycle simulation in 
Somerset does an excellent job of cov- 
ering functionality of the design, but 
it cannot easily detect the extra delays 
and bubbles in the design that affect 
performance. There can be situations 
in which the logic model is function- 
ally correct but causes the instruc- 
tions to wait extra cycles in the pipe- 
line. BRAT was used to discover these 


kinds of errors. In this process, the 
BRAT model and the logic model 
were compared on a cycle-by-cycle 
basis for performance-critical test 
cases. This was also used to verify the 
BRAT model as mentioned earlier. 

An integrated version of BRAT is 
also available as the optional timer 
feature in the IBM PowerPC Visual 
Simulator (PVS). System designers 
can use this tool to study the memory 
organization of a proposed PowerPC- 
based system. PVS provides a 
PowerPC Architecture simulator to 
drive the timing simulator. Program 
executables are executed using the 
architecture simulator and the result- 
ing program traces are fed to the tim- 
ing simulator for performance analy- 
sis [7]. 

Summary 

The PowerPC performance modeling 
was based on a trace-driven simula- 
tion approach. BRAT, a parameter- 
ized, trace-driven, and superscalar- 
based simulation tool was developed 
to define various organizations of the 
PowerPC chips. Sampled traces were 
used to reduce the cost and the turn- 
around time of the simulation. The 
representative of sampled traces were 
verified by comparing the outputs of 
the sampled and continuous trace 
simulation. The performance models 
were validated against the logic mod- 
els throughout the design stage. Per- 
formance modeling tools were used 
by the design engineers for doing 
design trade-offs. 

To enhance the quality of the cur- 
rent performance methodology, a 
combination of trace-driven and exe- 
cution-based models is desirable for 
more accurate performance model- 
ing. Design teams should work more 
closely with the compiler developers, 
operating system developers, and sys- 
tem designers from the beginning of 
the design cycle to achieve better sys- 
tem performance. 

Acknowledgments 

The author would like to thank Mike 
Peters for his assistance during the 
development of the BRAT tool, 

S. Surya for generating the SPEC 92 
traces and doing cache simulation, 
Yanling Qi, for BRAT support, Rob- 
ert Rosenblum and John Bunda for 
their assessment of an earlier version 



54 June 1994/V61.37, No.6 communications oftheacm 


rf this article. Many thanks to Don 
Sfaldecker, Chin-Cheng Kau, David 
itan, Gary Veneski, Ray Dupont, 
?e:er Song, Deene Ogden, Sonya 
Oarv. Brad Burgess, Harry Dwyer, 
5eve Ewedemi, and David Lee, for 
. cir feedback and support during 
ie project. O 

References 

1. Henessey J., and Patterson, D. Corn- 
outer Architecture: A Quantitative Ap- 
proach. Morgan Kaufmann, San Mateo, 
Calif., 1990. 

S. Johnson, M. Superscalar Microprocessor 
Design. Prentice-Hall, Englewood 
Cliffs, N.J., 1991. 

3. Laha, S., Patel, J., and Iyer, R. Accurate 
low-cost methods for performance 
evaluation of cache memory systems. 
IEEE Trans. Comp. 37, 11 (1988), 1325- 
1336. 


4. Liu, L., and Peir, J. Cache sampling by 
sets. IEEE Trans. VLSI Syst. 1, 2 (1993), 
98-105. 

5. Poursepanj, A., et al. The PowerPC 603 
microprocessor: Performance analysis 
and design trade-offs. In Proceedings of 
COMPCON 1994 (Feb. 1994). 

6. Poursepanj, A., and Wu, C. Trace sam- 
pling for design trade-offs of micropro- 
cessors using SPEC 92 Integer bench- 
marks IBM Tech. Rep. 

7. The PowerPC Visual Simulator for 
RISC System/6000, IBM Corporation. 

8. SPEC 92 Release notes, System Perfor- 
mance Evaluation Corporation. 

9. Stunkel, C., Janassens, B., and 
Fuchs, W.K. Address tracing for paral- 
lel machines. IEEE Comput. (Jan. 1991), 
48-54. 

About the Author: 

ALI POURSEPANJ is an advisory engi- 
neer/scientist and team leader of the 


PowerPC microprocessors group at IBM’s 
Somerset Design Center. Current re- 
search interests include computer archi- 
tecture, performance modeling, and trace 
sampling. Author’s Present Address: 
IBM Somerset Design Center, 9737 Great 
Hills Trail, Austin, TX 78759; email: 
ali@ausvm6.vnet.ibm.com 

IBM, PowerPC, PowerPC 603, PowerPC 604, 
PowerPC 620, PowerPC Architecture, and RISC 
System/6000 are trademarks of International 
Business Machines Corporation. SPEC is a 
trademark of System Performance Evaluation 
Corporation. 

Permission to copy without fee all or parts of this 
material is granted provided that the copies are 
not made or distributed for direct commercial 
advantage, the ACM copyright notice and the 
title of the publication and its date appear, and 
notice is given that copying is by permission of 
the Association for Computing Machinery. To 
copy otherwise, or to republish, requires a fee 
and/or specific permission. 

© ACM 0002-0782/94/0600 $3.50 


We Can Think Of 9 
Good Reasons To 
Immunize On Time. 

Measles 
Mumps 
Diphtheria 
Tetanus 
Hepatitis B 
Rubella 

Spinal Meningitis 
Pertussis 
Polio 


But You Only 
Need One. 



Immunize On Time. 

Your Baby’s Counting On You. 


Call 1-800-232-2522 


!E! 


U.S Department of Health and Human Services 



ACM LAUNCHES 
MULTIMEDIA and SOUND 


SIGMULTIMEDIA provides a forum for 
researchers, engineers and practitioners 
in all aspects of multimedia computing, 
communication, storage and applications. 

SOUND is a new electronic forum for the 
exchange of information on software, 
algorithms, hardware, and applications for 
digitally generated and/or manipulated 
audio signals. 

For more information, contact ACM's 
Member Ser-'ces Department, 1515 Broadway, 
New York, NY 1 0036 

Phone: 1-212-626-0500 
(in metro NY and outside U.S.) 

1-800-342-6626 
A U.S. and Canada) 

Fax: 212-944-1318 
Email: ACMHELP@ACM.org. 



COMMUNICATIONS OF THE ACM June 1994/Vol.37, No.6 55 





A Modular Approach to Motorola 
PowerPC Compilers 

The need for balance between software and hardware is a well-known principle of RISC 
microprocessor design methodologies. In order to achieve a high level of performance, 
RISC microprocessors are designed to allow compilers to take full advantage of the 
pipelines and resources available. The PowerPC family of microprocessors is being 
designed to be used for many purposes, ranging from low-power embedded controllers to 
powerful, supercomputer-class multiprocessor systems. This diversity of uses will lead to an 
equally diverse set of operating system environments, including AIX, Macintosh OS, 
Solaris and Windows/NT, among others. Despite the multitude of PowerPC processor and 
system configurations being developed, there remains a need for highly optimizing com- 
pilers that utilize both the base PowerPC Architecture as well as other implementations of 
the chips and systems designed around the architecture. 


Motorola has developed a highly 
optimizing, modular compilation 
environment that can be quickly 
adapted to various PowerPC micro- 
processor and system configurations. 
This suite of C, C++ and Fortran 
compilers is designed to meet the fol- 
lowing criteria: 

• Highly optimizing, ensuring opti- 
mal performance for PowerPC micro- 
processors 

• Highly retargetable, ensuring 
rapid time-to-market 

• Highly configurable, supporting 
multiple object file and debugging 
formats 

• Compliant to software standards, 


ensuring portability of code between 
chips 


This article will describe the modular 
structure of the compiler, the data 
and information flow through the 
major phases of the compiler, and 
offer some discussion on architecture 
and implementation-specific optimi- 
zations currently performed in the 
PowerPC compilers. Through exam- 
ples and descriptions, an understand- 
ing can be achieved as to how the 
Motorola PowerPC compilers are 
designed to provide the high perfor- 
mance and diversity that is essential 
to the PowerPC Architecture. 


Motorola Compiler Structure 

The Motorola compilation system, 
based partially on technology ac- 
quired from Apogee Software for the 
88000 architecture, consists of a se- 
ries of components that collectively 
can provide highly optimized 
PowerPC microprocessor code for a 
wide range of source languages, ob- 
ject file and debugging formats, and 
system environments. Conceptually, 
the heart of the Motorola compilation 
system is a common core that inte- 
grates multiple front ends with 
target-specific code generators to 
provide consistently high perfor- 
mance across an extremely wide 
range of target environments (see 




56 June 1994/Vol.37, N0.6 COMMUNICATIONS OP THE ACM 



iinonally, the Motorola com- 
r b designed to maximize perfor- 
r tor RISC applications by pro- 
tz three distinct internal 
entations of the program: a 
t presentation, a RISC repre- 
n and a Machine representa- 
Ibllectively, these three repre- 
create a foundation on 
a series of program transfor- 
js and optimizations can be 
uucted. Each of the three repre- 
0q ns is uniquely suited to a par- 
- class of program transforma- 
c- ensuring that optimizations can 


tor used for all Motorola RISC micro- 
processors. The Tree representation 
is best suited for those optimizations 
and transformations that require in- 
formation about the programmatic 
relationships of the high-level code. 
For instance, array subscripts can be 
accessed in a more direct manner by 
the Tree representation than in the 
RISC or Machine representations, 
where program constructs are bro- 
ken down into actual memory refer- 
ences consisting of a base address and 
an offset. Such information is needed 
for loop transformations that can sig- 
nificantly enhance performance. 


operators, the compiler can quickly 
perform certain control flow optimi- 
zations to minimize branch instruc- 
tions — an important consideration 
for superscalar RISC microproces- 
sors such as PowerPC microproces- 
sors. Explicit control flow informa- 
tion also simplifies the process of 
transforming program loops, and 
often helps identify loop induction 
variables which must be carefully ana- 
lyzed as part of the loop transforma- 
tion process. 

The second representation is a set 
of instruction templates resembling a 
generic RISC instruction set. The pri- 



RS/6000 
Sun 4 
x86 PC/NT 
Macintosh 



ANSI C 
ANSI C++ 
Microsoft C/C++ 
FORTRAN 77 


Multiple host platforms 


Common Global Optimizer 
Register Allocation 
Instruction Scheduling 
Interprocedural Analyzer 
Code Generation 



XCOFF (AIX) 

ELF (SVR4/Solaris) 
MCOFF (Microsoft/NT) 


STAB (dbx/gdb) 
DWARF 

CV (Microsoft/NT) 


PowerPC 
601/603/604/620 
Embedded control 


Big endian 
Little endian 
Embedded control 


tfficiently applied and that trans- 
lations are sequenced in a logical 


ram Representations 
first of these representations is 
rmotated parse tree, which pro- 
an abstract representation of 
> program. This representation is 
narily independent of both the 
rrmal program source language 
k die underlying target processor 
If Cementation. All language front 
used with the Motorola compila- 
r suite share this initial representa- 
pr. thereby enabling many lan- 
Lue implementations to leverage 
^ same optimizer and code genera- 


Another common aspect of trans- 
formations performed on the Tree 
representation is that such changes 
tend to expose additional informa- 
tion for later phases of the compiler. 
For example, subroutine inlining or 
interprocedural analysis can expose 
details to the optimizer that would 
not otherwise be available. This anal- 
ysis is accomplished by looking across 
subroutines, and, in some cases, file 
boundaries. 

The Tree representation does not 
treat the program as a sequence of 
machine instructions, and, therefore, 
does not yet divide the program into 
basic blocks. With program control 
flow still explicitly represented as tree 


Figure i. Overview of Motorola 
compilation system 


mary purpose of this RISC represen- 
tation is to permit the global opti- 
mizer to efficiently perform a series of 
aggressive performance-enhancing 
transformations without regard to 
the details of the underlying RISC 
microprocessor. Although it may 
seem unusual to ignore the details of 
the implementation when perform- 
ing optimizations, doing so permits 
the optimizer to focus only on those 
aspects of a machine model that are 
relevant to RISC microprocessors. 
For example the RISC intermediate 


COMMUNICATIONS OP THE ACM June 1994/Vol.37, No. 6 5T 


representation assumes an unlimited 
number of registers, a load/store 
memory architecture, and arbitrary- 
size literal values. While such as- 
sumptions represent only estimates, 
they are particularly useful consider- 
ing the design of RISC microproces- 
sors. For example, the assumption of 
infinite register resources is appro- 
priate when considering a modern 
compiler that is targeting the 
PowerPC Architecture, which has a 
total of 32 general-purpose registers 
and 32 floating-point registers. 

Conversely, this assumption would 
be inappropriate for a CISC architec- 
ture that has only a few registers 
available for general use by the com- 
piler. Additionally, permitting arith- 
metic operations to be applied di- 
rectly to operands located in memory 
may be suitable for a CISC architec- 
ture that supports such an addressing 
mode, but would clearly be inappro- 
priate for most RISC architectures, 
including the PowerPC Architecture. 
Another key attribute of the RISC in- 
termediate representation is that pro- 
gram control flow is now implicitly 
described by representing the RISC 
instruction templates as a series of 
basic blocks that can be connected ei- 
ther as a doubly linked sequential list 
or as a directed graph. Such a repre- 
sentation is often more appropriate 
than a tree structure for global opti- 
mizations, since most global optimiza- 
tions require extensive global data 
and control flow analysis. To contrast 
the applicability of the RISC interme- 
diate representation with that of the 
previous Tree representation, con- 
sider that most transformations 
performed on the Tree representa- 
tion are associated with program 
structure, whereas most RISC-level 
transformations are involved with 
instructions. 

The third intermediate represen- 
tation used by the compiler is an 
annotated version of the RISC repre- 
sentation. This representation con- 
tains additional information about 
machine resources, including reg- 
isters and functional unit descrip- 
tions. Additionally, this representation 
is used by the compiler to per- 
form target-specific transformations, 
including register allocation, in- 
struction scheduling, and branch 
prediction. 


The Machine representation is cre- 
ated by expanding the instruction 
templates of the RISC representation 
into instructions or sequences of in- 
structions that have a 1:1 corre- 
spondence with the actual ma- 
chine instructions of the target micro- 
processor. In some cases, a single 
RISC instruction template may ex- 
pand to several PowerPC instruc- 
tions. For example, the PowerPC 
Architecture does not include an 
instruction to perform integer ab- 
solute value. However, the RISC in- 
termediate representation of the 
program treats “absolute value” 
as a single operation, in order to 
present the optimizer with a clear 
view of the intended program se- 
mantics. This operation is expanded 
into a series of Machine instruction 
templates, as shown in Figure 2a. 

In other cases, multiple RISC in- 
struction templates are combined 
into a single machine instruction for 
the PowerPC Architecture. For in- 
stance, a pair of instructions such as a 
floating-point multiply operation that 
feeds its result to a floating-point add 
operation can be combined into a 
single “floating-point multiply-add” 
(fmadd) instruction by the compiler, 
as illustrated in Figure 2b. 

In order to maximize the retar- 
getability of the compiler with respect 
to both the compilation environment 
and processor performance, the Ma- 
chine representation used by the 
compiler is largely table-driven. Such 
an approach permits common algo- 
rithms and data structures to be 
shared for all environments while still 
permitting the code to be optimized 
for a particular target implementa- 
tion. The target-specific data tables 
contain a great deal of information 
regarding the underlying micropro- 
cessor. The information typically 
addresses three target attributes: 1) 
the programming environment, 2) 
the architecture, and 3) the microar- 
chitectural implementation. 

The programming environment. 
The first type of information con- 
veyed by the machine description 
tables involves the target program- 
ming and execution environment. 
The Motorola compilation system can 
support a number of object file for- 
mats, such as XCOFF and ELF, as 
well as several debugging formats, 


including STAB (dbx), DWARF, and 
CodeView (NT). Each of these for- 
mats requires compiler support of 
some kind, and often requires differ- 
ent subroutine calling sequences or 
data layouts. For example, the Moto- 
rola NT compiler must target a little- 
endian environment that requires 
CodeView symbolic debugging infor- 
mation, whereas the Motorola C and 
Fortran compilers for AIX systems 
use big-endian, XCOFF conventions 
and a STAB-style debugging format. 
By describing such information in an 
abstract, table-driven manner the 
task of retargeting the compilers to a 
different combination of program- 
ming environment attributes is 
greatly simplified, leading to reduced 
time-to-market for PowerPC compil- 
ers. This results in reduced time to 
market for PowerPC systems and 
application developers. 

The architecture. Architectural 
considerations in the machine de- 
scription include the size and number 
of machine registers, a grammatical 
description of the instruction set and 
addressing modes, and a list of literal 
operand sizes supported. These attri- 
butes are shared among all PowerPC 
microprocessor family members, and 
therefore do not need to be changed 
as often in the compiler. 

The microarchitectural implemen- 
tation. Microarchitectural attributes 
addressed by the machine descrip- 
tion include a detailed three-level 
description of all machine resources 
for a particular PowerPC implemen- 
tation. This description is used exten- 
sively by the instruction scheduler to 
determine the optimal code sequence 
for a particular application on a spe- 
cific PowerPC implementation. 

The first level of microarchitec- 
tural description is a set of resource 
definitions. A resource can be virtu- 
ally any aspect of the microarchitec- 
ture that is desirable to take into ac- 
count when performing instruction 
scheduling. Examples include a func- 
tional unit pipeline stage, an instruc- 
tion fetch or dispatch slot, or a read 
or write port to a register file. Re- 
sources can be singularly or multiply 
defined, allowing a fairly arbitrary set 
of RISC microprocessor attributes to 
be created as building blocks for a 
higher-level machine description. 


58 June 1994/V61.37, No.6 


The Making of the 


A single RISC instruction template may 


expand to several PowerPC instructions. 


For instance, the number of instruc- 
tion issue slots differs between the 
603 and 604 microprocessors. Conse- 
quently, this fact can be accurately 
described in the corresponding com- 
piler machine description tables. 

The second level of microarchitec- 
tural description is formed by com- 
bining the machine resources into a 
series of arbitrary functional units. 
Associated with these functional units 
are timing templates that, when com- 
bined with the combinatorial flexibil- 
ity of the functional unit descriptions, 
enable the compiler to accurately 
portray the multiple pipelined func- 
tional units of the superscalar 
PowerPC microprocessors. This ap- 
proach is conducive to rapid proto- 
typing of compilers for future 
PowerPC implementations, since it is 
relatively straightforward to add or 
replicate a functional unit and evalu- 
ate the performance effects caused by 
such a change. An effort is under way 
to integrate the machine description 
used by the Motorola compiler with 
simulators being developed by Moto- 
rola [2] and Carnegie Mellon Univer- 
sity [1], allowing a single change to 
the microarchitectural description of 
a PowerPC microprocessor to gener- 
ate a new simulation model as well as 
new code that is tuned for that 
model. Such integration would allow 
for making more accurate and timely 
microprocessor design decisions. 

The third level of the microarchi- 
tectural description used in the Moto- 
rola compilers involves the map- 
ping of each machine instruction to 
one or more functional units. Such a 
mapping creates the link needed 
between the Machine representation 
of the code and the actual timing 
information required for accurate 
scheduling. 

Phases of Compilation 

The Motorola compilers are imple- 
mented as a series of interrelated 
phases that ultimately transform a 


user’s high-level source program into 
highly optimized PowerPC assembly 
code which can then be assembled 
and linked to form an executable 
program. Figure 3 outlines some of 
the major phases of the Motorola 
compilers. Traditionally, compilers 
have often been described in terms of 
a “front end” and a “back end.” For 
the Motorola compiler, the front end 
makes use of the Tree intermediate 
representation; the back end, which 
is typically shared by multiple front 
ends, makes use of both the RISC and 
Machine representations. 

In the compiler “front end,” 
source language programs are parsed 
and transformed into the Tree inter- 
mediate representation after appro- 
priate syntax and semantic checking 
takes place. In some cases, such as the 
Motorola NT compiler, an external 
language interface similar to the 
Microsoft C/C+ + front end is used in 
place of the standard Motorola front 
end. Such integration is desirable for 
market segments where millions of 
lines of existing code already exist, or 
where system-specific language ex- 
tensions are needed. Through the 
integration of the Microsoft front end 
with the Motorola back end for the 
NT compiler, existing Microsoft ap- 
plication code and the associated 
build structures (i.e., Makefiles) can 
be quickly ported to PowerPC sys- 
tems with little or no source changes 
while still picking up the benefits of 
the highly optimizing Motorola back 
end. 

Optimizations performed during 
this phase of compilation include sub- 
routine inlining, interprocedural 
analysis and a series of loop transfor- 
mations, such as software pipelining 
and loop unrolling. These program 
transformations all share the com- 
mon goal of improving performance 
by exposing more information to 
later phases of the compiler. 

On completion of front-end pro- 
cessing, the Tree representation is 


converted into the RISC representa- 
tion, which is used by the global 
optimizer. A substantial number of 
complex global optimizations are per- 
formed during the next phase of op- 
timization. A series of data flow and 
control flow analyses are carried out 
to determine the applicability and de- 
sirability of individual optimizations. 

Most optimizations, including 
common subexpression elimination, 
constant folding and propagation, 
loop invariant removal and dead 
code elimination, are performed as a 
series of passes over the RISC repre- 
sentation of the program. Due to the 
change in program state caused by 
some transformations, certain optimi- 
zations are applied multiple times 
throughout this phase of compilation. 
Much of the information gathered 
during the optimization phase can 
also be used to globally allocate 
registers to program values that are 
“live” across multiple basic blocks. 
The global register allocation (GRA) 
phase also provides an estimate of 
“register pressure” which is used sub- 
sequently by the instruction schedul- 
ing heuristics. 

Once the global register allocation 
phase is complete, the RISC repre- 
sentation is expanded into the Ma- 
chine representation, which is used 
for the instruction scheduling and 
local register allocation (LRA) phases 
of compilation. Instruction schedul- 
ing is applied in two stages. First, in- 
structions are globally moved across 
basic block boundaries to hide in- 
struction latencies that may not be 
possible to hide within the original 
basic block. Such code motion is par- 
ticularly valuable for “compare” in- 
structions in the PowerPC Architec- 
ture, since there is often a significant 
delay between the generation of a 
condition code and its use in some 
PowerPC implementations. After 
global scheduling is complete, a local 
scheduling pass is implemented 
within each basic block. It is this 


communications op the acm June 1994/V61.37, No.6 59 



1 


n g of the PowerPC 


IABS rD,rS becomes: srawi rX,rS,31 

xor rY,rS,rX 

subfc rD,rX,rY 

a. Instruction expansion from RISC to machine representations 

FMUL rT, rSl, rS.2 becomes: fmadd rD, rSl, rS2 , rS3 

FADD rD,rT,rS3 

b. Instruction combination from RISC to machine representations 



c 


Parser 


3 


f Inlining/IPA } 

^ I 'f 

^Loop transformations^ 

Q Global 


c 


GRA 


T 


Scheduling 


D 



Peephole 


T 


Code Generation 


Tree representation 

• Parsing, syntax checking 

• Subroutine inlining 

• Loop transformations 

• Interprocedural analysis 


RISC representation 

• Alias analysis 

• Global optimizations 

• Global register allocation 


Machine representation 

• Global Instruction Movement 

• Instruction scheduling 

• Branch prediction 

• Local register allocation 

• Peephole optimizations 

• Code generation 


601 schedule 

603 schedule 

lfs 

f 1, 0x0 (r4 ) 

lfs 

f 1 # 0x0 (r4 ) 

lfs 

f 2 , 0x0 ( r3 ) 

addi 

r5 , r5 , 0x1 

addi 

r5 , e5 , 0x1 

lfs 

f 2 , 0x0 (r3 ) 

fmuls 

fl, fl, f2 

cmpi 

0x6 , 0x0 , r5 , Oxle 

cmpi 

0x6 , 0x0 , r5 , Oxle 

fmuls 

f 1, f 1, f 2 

stf su 

fl, 0x4 ( r4 ) 

stf su 

fl, 0x4 (r4 ) 

be 

Oxc , 0x18 , L . . 2 

be 

Oxc, 0x18, L. . 2 


Figure 2. Mapping RISC represen- 
tation to machine representation 

Figure 3. Compilation phases 

Figured Scheduling difference 
between a) 601 and b) 603 com- 
pilers 


phase of compilation that exploits the 
superscalar nature of the PowerPC 
microprocessors. 

Currently, the Motorola compilers 
can be directed by command line 
flags to schedule code for any of the 
601, 603, 604 and 620 implementa- 
tions. This process often leads to sig- 
nificant performance improvements, 
while still maintaining instruction set 
compatibility across all family mem- 
bers. Local register allocation follows 


instruction scheduling for each basic 
block. Although there are trade-offs 
between the relative ordering of the 
scheduling and register allocation 
phases, the integration of scheduling 
between GRA and LRA helps mini- 
mize the potential difference in the 
resulting code. Finally, after a brief 
“peephole” optimization phase, as- 
sembly code is generated by the com- 
piler and the compilation process is 
completed. 

Compilation for PowerPC 

The PowerPC Architecture is a full 
64-bit RISC architecture that was 
derived from the IBM’s existing 
POWER architecture [4]. In defining 
the PowerPC Architecture, the 
POWER architecture was simplified 
in several ways to remove aspects that 
might prohibit aggressive superscalar 
and high clock rate implementations. 
With respect to the compiler, there 
are several aspects of the PowerPC 
Architecture that distinguish it from 
other RISC architectures. The 
PowerPC Architecture contains sev- 
eral features that affect the compiler. 
These include the instruction set, 
special-purpose registers, the branch 
model and the memory model. 

Instruction set. The PowerPC in- 
struction set offers several key in- 
structions that compilers can utilize to 
improve performance. One key type 
of instructions includes the path- 
length reduction operations such as 
“floating-point multiply-add” and 
“load/store with update.” A second 
type of instructions includes a power- 
ful set of bit-field operations that can 
sometimes be used to replace com- 
paratively expensive branch instruc- 
tions. Finally, special instructions are 
also included to perform logical oper- 
ations on condition code results, once 
again leading to a decrease in branch 
instructions. 

Special-purpose registers. The 
PowerPC Architecture has several 
special-purpose registers used by the 
compiler, including the Condition 
Code register, the Link register, and 
the Count register. The Condition 
Code register has eight fields that can 
hold the results of comparisons. 
These fields are usually set by com- 
pare instructions, or, optionally, by 
the record form of other instructions. 


60 June 1994/V61.37, No.6 communications of the acm 


Conditional branch instructions most 
commonly use the condition code 
r.elds, although condition codes can 
also be moved to or from the general- 
purpose registers. 

The Link register is used for sub- 
routine linkage, as its name suggests. 
The Count register is used in several 
ways, though primarily to hold the 
index value in loops. The PowerPC 
.Architecture provides a branch con- 
ditional form which decrements the 
value in the Count register, compares 
the result to zero, and then branches 
based on the result of the compari- 
son. The compiler can make use of 
such a mechanism to eliminate both a 
decrement operation and a compare 
instruction. 

A register sometimes reserved 
through software convention, the 
table of contents (TOC) register, is 
not part of the architecture, but it is 
defined by the PowerOpen Applica- 
tion Binary Interface [3] as part of the 
convention for linking. The TOC is a 
common storage area used to look up 
all address constants and external 
symbols for a given object. 

Branch model. The conceptual 
model of the PowerPC Architecture 
defines the branch unit as an execu- 
tion unit. It fetches instructions from 
the instruction cache and processes 
branches out of the instruction 
stream. It dispatches instructions to 
the appropriate execution units. Of 
the multiple forms of branch address- 
ing employed by the PowerPC Archi- 
tecture, only one has a direct conse- 
quence for the compiler. For 
performance reasons, it is preferable 
to use the Count register, rather than 
the Link register, for the generation 
of register-indirect branches, includ- 
ing, for instance, the implementation 
of a jump table or the invocation of 
a subroutine through a memory 
pointer. 

Most implementations of the 
PowerPC Architecture implement 
some form of branch prediction in 
order to minimize potenual delays 
associated with control flow instruc- 
tions. The architecture define' an 
encoding of one bit in branch instruc- 
tions to provide static i.e.. compiler 
driven) prediction hints tc the micro- 
processor. If that bit is not set, then 
backward branches are predicted 


taken, and forward branches are pre- 
dicted not taken. Setting this bit in 
the encoding of the branch instruc- 
tion will reverse the sense of the pre- 
diction. Some microprocessors imple- 
ment dynamic prediction methods in 
which the static prediction bit may 
not be needed by the microprocessor. 

Memory model. The PowerPC Ar- 
chitecture allows for a weakly or- 
dered storage model. This gives mi- 
croprocessors freedom to rearrange 
memory accesses relative to the order 
specified in the program stream. The 
processor must ensure data depen- 
dencies are not broken in order to 
preserve program semantics. The 
PowerPC Architecture offers user- 
mode cache control instructions for 
both uniprocessor and multiproces- 
sor configurations. These instructions 
allow the programmer or compiler to 
increase performance by clearing, 
storing or preloading cache lines with 
data that may be needed soon in the 
instruction stream. 

Whereas the POWER architecture 
is a strictly big-endian data access 
model, the PowerPC Architecture was 
extended to a support both little- and 
big-endian modes. This mode can be 
dynamically switched and specifies 
that the little-endian transformation 
applies to both instructions and data. 
The architects provided full support 
for the PowerPC Architecture in mul- 
tiprocessing environments. The 
memory model was enhanced in such 
areas as atomicity, coherence and ali- 
asing in order to cover all aspects of 
multiprocessing environments. 

PowerPC Microarchitectural 
Differences 

Each implementation in the PowerPC 
Architecture has its own microarchi- 
tecture description. The microarchi- 
tecture defines the execution units, 
the pipelines, the bus, and the mem- 
ory model of a microprocessor. The 
information about these aspects of a 
chip are not needed by the compiler 
to produce correct code, but can im- 
pact performance significantly. The 
Motorola compilers have run-time 
flags that will target each implemen- 
tation in order to achieve the best 
performance for each member of the 
family. 

Superscaler issue. One major area of 


difference between implementations 
of the same architecture revolve 
around the instruction issue model. 
Each microarchitecture has different 
models for instruction issue as well as 
different execution units and instruc- 
tion pipelines. This impacts how the 
compiler should schedule sequences 
of code. The scheduler uses the im- 
plementation-specific information 
from the machine descriptor tables to 
determine how each instruction will 
flow through the machine. It then 
tries to reorder instructions to avoid 
machine resource conflicts and to 
provide the result from one instruc- 
tion by the time it is needed for any 
data-dependent instructions in the 
program. This is strictly a perfor- 
mance enhancement, and does not 
affect code compatibility. Figure 4 il- 
lustrates the difference in code sched- 
ules for two of the PowerPC imple- 
mentations, the 601 and 603 
microprocessors. 

This example is a short sequence 
that multiplies two floating-point val- 
ues together in a loop body. The 
scheduler interleaves the loads and 
integer instructions for the 603 be- 
cause the 603 has a load/store unit 
that is separate form the integer unit, 
and it can therefore issue the integer 
and load instructions together in one 
clock cycle. Load instructions flow 
through the integer unit of the 601, 
and cannot be issued with other inte- 
ger instructions. For this reason, the 
scheduler places the loads first in the 
sequence for the 601 so the results of 
the load will be available as soon as 
possible for the multiply instruction. 

Memory hierarchy. The complexity 
of the load/store units, primary and 
secondary caches, and main memory 
have strong effects on the perfor- 
mance of any microprocessor. It is 
important for the compiler to care- 
fully arrange memory operations in 
order to maximize overall perfor- 
mance. There are several areas in 
which the compiler can make sub- 
stantial improvements in the memory 
performance. 

Scientific code is likely to see the 
biggest effects of cache optimizations 
by the compiler. The abundance of 
array references in typical scientific 
code presents patterns of memory 
accesses which the compiler can ana- 
lyze. The compiler can then perform 


communications oi= the acm June 1994/Vol.37, N 0.6 61 


such optimizations as loop inter- 
changing and cache blocking. These 
types of transformations can change 
the patterns of memory accesses to 
better utilize the data cache. The 
transformations applied by the com- 
piler to the code depend on the size 
and organization of the caches, which 
vary across implementations. Some- 
times, just skewing the starting ad- 
dress of large data objects in memory 
can make a drastic difference in cache 
behavior and overall performance. 

One other method the compiler 
can use to help improve cache perfor- 
mance is data prefetching [6]. The 
PowerPC Architecture provides a 
data cache block touch instruction 
that will tell the processor to try to 
load the data at the specified address 
into the data cache [5]. It will not 
cause a system error handler to be 
invoked, and is therefore safe to issue 
even if that data turns out not to be 
used. On loop-based code, the touch 
instruction can be used to eliminate 
cache miss latencies by overlapping 
the load time with other computa- 
tions before the data is actually ac- 
cessed. A data cache block touch 
for store instruction is also avail- 
able which can take exclusive owner- 
ship of a block in a multiprocessing 
environment. 

For highly superscalar processors, 
control flow instructions can quickly 
become the bottleneck in instruction 
issue and throughput. There are sev- 
eral areas in which the compiler can 
help reduce this impact. One area is 
by performing transformations such 
as loop unrolling. This reduces the 
number of branches taken during the 
execution of a loop, and also makes 
the code from several iterations of the 
loop available to the compiler to 
schedule. This provides opportuni- 
ties for other types of optimizations 
that might not exist with just one iter- 
ation of the code. 

The compiler also provides predic- 
tion hints to processors about 
whether branches are likely to be 
taken or not taken. The 601 and 603 
microprocessors look at the predic- 
tion bit of the branch while the in- 
struction is loaded into the instruc- 
tion queue but before it is executed, 
in order to load the instructions 
down the predicted path into the in- 
struction queue. In the event that the 


branch is not resolved by the time the 
branch is executed, the instructions 
in the predicted path are specula- 
tively issued. If the prediction is 
wrong, then those speculative in- 
structions are cleared, and the in- 
structions from the other path are 
loaded and issued. This backup can 
take several clock cycles, impairing 
performance. Branches that didn’t 
need to be predicted because they 
were resolved in time and branches 
that are correctly predicted are desir- 
able because they don’t introduce 
bubbles or wasted clock cycles in the 
pipelines of the microprocessors. The 
compiler tries to schedule branches 
so the condition on which it depends 
will be determined before the branch 
is executed. In those cases where 
there are not enough instructions to 
schedule the condition and the 
branch apart, the compiler uses sev- 
eral branch prediction heuristics to 
set the prediction bit of the branch. 

Another method to lower the im- 
pact of branches in compiled code is 
to eliminate them altogether. The 
PowerPC Architecture has a powerful 
set of bit-field rotation and manipula- 
tion instructions. These instructions 
can sometimes be used in sequences 
of code that would traditionally re- 
quire a branch instruction. For exam- 
ple, obtaining the result of the com- 
parison of a value with zero is 
traditionally performed through use 
of a conditional branch in conjunc- 
tion with two possible assignments, 
depending on the result of the com- 
parison. In the PowerPC Architec- 
ture, such a value can be obtained 
without the use of branches as 
follows: 

cmpi crX,rS,0 

beq crX,$+8 

addi rD,rO,l 

becomes 


cntlzw rS,rS 

addic rS,rS,-32 

rlwlnm rD,rS, 1,3 1,31 


Memory space taken up by code 
and data as well as the run-time 
memory needed for stacks can be- 
come very important issues in some 
environments. The compiler has sev- 
eral options that optimize for code 


and data space. For instance, function 
prologue and epilog code are respon- 
sible for saving and restoring any 
callee saved registers in order to pre- 
serve the original values in the regis- 
ters. One space-saving option is to use 
subroutines that save and restore the 
registers instead of individual instruc- 
tions in the prologue and epilog. This 
replaces several instructions with one 
function call which saves code space. 
The trade-off is that the functions 
have overhead associated with them 
that may cause lower performance. 
Other optimizations that may help 
performance can increase space, such 
as loop unrolling and function inlin- 
ing. These kinds of optimizations can 
be turned off, while keeping other 
optimizations active. 

An optimization to save space in 
the data sections of programs is to 
sort the data and store the data ob- 
jects by size. This can eliminate 
wasted space that might otherwise be 
needed for padding in order to keep 
data objects on proper alignment 
boundaries. The PowerPC Architec- 
ture does not require data references 
to be aligned, but performance may 
be drastically reduced if unaligned 
references are made. There is a docu- 
ment under the working name of the 
PowerPC Embedded Systems Specifi- 
cation. This specification is defining a 
standard for embedded systems and 
puts a much higher emphasis on 
space saving than the PowerOpen 
ABI. Changes in the areas men- 
tioned, and other changes such as the 
stack definition to decrease the 
amount of memory needed are de- 
fined for systems in embedded sys- 
tems areas. 

Summary 

Motorola has developed a modular 
approach to generating highly opti- 
mizing PowerPC compilers for a 
broad range of target and host envi- 
ronments. By leveraging a common 
set of intermediate representations 
and optimization routines, the com- 
pilers can be quickly customized for 
individual PowerPC implementa- 
tions, including the 601, 603, 604 and 
620 microprocessors, as well as em- 
bedded control versions of the archi- 
tecture currently under design. On- 
going efforts to integrate the 
compilers with simulators, debuggers 


62 June 1994/Vol.37, N0.6 COMMUNICATIONS OF THE ACM 




and programming environments will 
provide . additional mechanisms 
:hrough which the modular design of 
:he compilers can be exploited to re- 
duce time to market for PowerPC sys- 
tems and application developers in all 
segments of the computing industry. 

□ 

References 

1. Anderson, W. An overview of moto- 
rola’s PowerPC simulator family. Com- 
mun. ACM (June 1994). 

2. Diep, T., Shen, J. and Phillip, M. EX- 
PLORER: A retargetable and visualiza- 
tion-based simulator for superscalar 
processors. In Proceedings of the Twenty- 
Sixth International Symposium on Microar- 
chitecture, Austin, Tex., Dec. 1993. 

3. PowerOpen Application Binary Interface. 
Draft for First ed., IBM, Oct. 1993. 

4. PowerPC User Instruction Set Architecture. 
Book I, Version 1.04, May 4, 1993. 

5. PowerPC Virtual Environment Architec- 
ture. Book II, Version 1.04, May 4, 
1993. 

6. Software prefetching. SIGPLAN Not. 
26, 4 (Apr. 1991), 40-52. 


PowerPC, PowerPC Architecture, PowerPC 601, 
PowerPC 603, PowerPC 604, PowerPC 620 are 
registered trademarks of IBM. 


About the Authors: 

JULIE SHIPNES is currently a senior 
design engineer in the RISC software 
group at Motorola in Austin, Texas. Cur- 
rent technical interests include compiler 
optimization techniques, microprocessor 
performance analysis and high-perfor- 
mance graphics applications. 


MIKE PHILLIP is currently the manager 
of the compiler and tools development 
team in the RISC software group at Moto- 
rola in Austin, Texas. Current technical 
interests include parallelizing compilers, 
microprocessor performance analysis and 
microarchitectural simulation. Authors’ 
Present Address: Motorola RISC Soft- 
ware Group, Mail Drop OE1 12 6501 Wm. 
Cannon Drive West Austin, TX 78735; 
email: { julie , phillip }@oakhill.sps.mot. 

com; @oakhill. sps.mot.com 


Permission to copy without fee all or part of this 
material is granted provided that the copies are 
not made or distributed for direct commercial 
advantage, the ACM copyright notice and the 
title of the publication and its date appear, and 
notice is given that copying is by permission of 
the Association for Computing Machinery. To 
copy otherwise, or to republish, requires a fee 
and/or specific permission. 


© ACM 0002-0782/94/0600 $3.50 


BE LESS PRODUCTIVE 
AT THE OFFICE. 


office has always been a 
place to get ahead. Unfortunately, 
its also a place where a lot of natural 
resources start to fall behind. Take a 
look around the next time youre at 
work. See how many lights are left 
on when people leave. See how much 
paper is being wasted. How much 

| electricity is being used to 
run computers that 
are left on. Look 
at how much water is 




Set up a recycling bin for aluminum 
cans and one for bottles. And when 
youre in the bathroom brushing 
your teeth or 
washing your face, 
don't let the faucet 



Drink out 
of mugs 
instead of 


run. Remember, if we 


; fewer i 


throwaway cups. 
today, well save more for tomorrow. 

Which would truly be a job well done. 

FDR MORE INFORMATION AND UPS 

OIL T800-MY-SHARE. 



Use both sides 

of the paper being wasted in the 
when writing 

a memo. restrooms. And 
how much solid waste is 
being thrown out in the 
trash cans. We bet its a lot. 

Now, here are some simple wavs 
you can produce less waste at wock. 
V/hen youre at the coner. : r_r 
make the copies you need. Use reck 
sides of the paper wher. -ria: i 
memo. Turn off your hgh : htz : _ 
leave. Use a lower wm :-_r ;?_r 

lamps. Drink vour c: :r :ur 

of mugs instead : : v 

IT'S A CONNECTED AOR UX 
DO YOUR SHAPE 



A Pub : w - 
This P-i :* 



.Earth Share 


William Anderson 



An Overview of Motorola s PowerPC 

Simulator Family 

The successful introduction of a new computer architecture into the marketplace 
requires that both software and hardware be available simultaneously at the time of 
system introduction. Moreover, there is substantial need for software tool support (e.g., 
compilers and simulators) during the design phase of the microprocessor itself. Such 
phased development necessitates coordination among several groups: 


• Application and operating system 
software developers need to be able 
to compile, execute and debug their 
programs so applications and operat- 
ing systems are available at system 
introduction. 

• System hardware designers need to 
be able to design, simulate and debug 
circuit boards, system ASICs and 
monitor executive software so the sys- 
tem (e.g., a workstation and micro- 
computer) can be integrated as soon 
as possible after the availability of 
working silicon. 

• Microprocessor designers them- 
selves need software tools in order to 
write and debug diagnostic software 
used to debug chip designs and test 
the functionality of the final silicon. 

• Prospective customers (either sys- 
tems vendors or end users) need to 
be able to “kick the tires” of a new 
microprocessor design by compiling 
well-known or application-specific 
benchmark code and running it on a 
simulator in order to get estimates of 
system performance. 


An essential part of each of these 
functions is a software simulator of 
the new microprocessor(s) that solves 
the designer’s needs. Currently, no 
single simulator program can solve 
the diverse needs of these user 
groups. However, any microproces- 
sor simulator should have several 
characteristics that enhance its utility 
to the end user, such as being highly 
configurable (in order that an appro- 
priate configuration of the simulator 
be available to solve the particular 
problems of the end user), extensible 
(so that the end user’s effort in using 
the simulator has maximum effect 
and to avoid unnecessary redesign 
and “creeping featurism” by the sim- 
ulator implementers), and as efficient 
as possible (to allow the end user to 
iterate through as many compile- 
execute-debug cycles as possible in a 
given amount of time and to mini- 
mize the time necessary to run a 
benchmark). 

Motorola and IBM are currently 
developing the PowerPC family of 


RISC microprocessors [7, 8, 9]. To 
support this effort, Motorola’s RISC 
software team is developing new soft- 
ware tools, including state-of-the-art 
compilers and simulators. This article 
describes three members of Moto- 
rola’s PowerPC simulator family. All 
of these simulators are based on a 
common set of object-oriented soft- 
ware technology. 

Features of Motorola’s 
PowerPC Simulators 

All members of Motorola’s PowerPC 
simulator family share some common 
features. This common feature set 
ensures that support code developed 
on one simulator will run appropri- 
ately on another member of the fam- 
ily. It also lessens the burden of learn- 
ing a new set of commands for a new 
simulator. 

Command line interpreter. All of 
Motorola’s PowerPC simulators use 
Tel [6] as a command line interpreter 
(CLI) language. Tel (a “tool com- 


64 June 1994/Vol.37, No.6 communications of the acm 




mand language” which is pro- 
nounced “tickle”) is an efficiently 
pajrsed language created by John 
Ousterhout of the University of Cali- 
fornia at Berkeley. Tel offers excel- 
lent facilities for string manipulation 
and program extension. Many of the 
system support utilities are imple- 
mented in Tel. 

Tel allows the user to define vari- 
ables and subroutines, implement 
flow control and in general adapt the 
user interface to the particular needs 
■ *1 the user. Tel allows tedious se- 
quences of simulator commands to be 
replaced by parameterized command 
shortcuts. Also, the simulators try to 
make as few policy decisions as possi- 
>le. Instead, these policy routines 
may be implemented as Tel subrou- 
tines. Then, these routines are in- 

Ted as a callback routine from the 
simulator command interpreter. In 
this way, the end user may imple- 
ment arbitrarily complex task han- 
dler routines for managing simulator 
events or error conditions. 

A distinct advantage to Tel as a CLI 
language is the availability of the Tk 
extension to Tel. Tk is a widget li- 
brary that works with the X Window 
Svstem in order to facilitate rapid 
development of graphic user inter- 
laces (GUIs). Motorola’s PowerPC 
architectural and timing simulators 
•.ill be extended with user-extend- 
able GUIs based on the Tk toolkit in 
iiiture versions. 

There is also a very easy-to-use 
Aias facility (also implemented in Tel) 

.. hich allows an end user to use one- 
word abbreviations for longer com- 
mands, much like the alias facility 
vailability in some Unix shells. 

Operating system emulation mode. 
All of Motorola’s PowerPC simulators 
were written to simulate the entire 
irtual machine presented by the cor- 
c'ponding PowerPC microproces- 

r. This in particular allows the end 
user to boot, debug and analyze the 
P erformance of an operating system 
kernel. This capability requires that 
an appropriate kernel be compiled 
with the correct device drivers and 
that corresponding external device 
emulation modules have been writ- 
m and dynamically linked into the 
emulator. 

Booting a kernel is an excessive 


amount of work if the goal of simula- 
tion is to evaluate the performance of 
user-level programs. Therefore, the 
Motorola PowerPC simulators have 
the ability to intercept the system call 
trap and to vector the simulator code 
to a predefined set of subroutines 
which emulate the operating system 
trap interface. The simulators come 
with an emulator for AIX (Advanced 
Interactive executive, IBM’s version 
of Unix). However, this OS emulation 
module may be replaced by the end 
user by means of a user-supplied trap 
table and trap emulation code which 
is linked at run time by means of a 
shared object file. 

Support for multiprocessing config- 
urations. The Motorola PowerPC ar- 
chitectural and timing simulator fam- 
ilies provide direct support for 
multiprocessing configurations. In 
the case of the architectural simula- 
tors, a very simple nonpipelined one- 
instruction-per-clock timing model is 
used for the shared memory bus. The 
timing simulators use a clock-phase 
accurate simulation of the shared 
memory bus to achieve support for 
multiprocessing. In both cases, the 
full semantics of the shared memory 
system (in particular, the bus snoop- 
ing required to implement a MESI 
cache protocol) are implemented. 

Object-oriented design in C++. All 
of the simulators are implemented in 
C + + [4, 5] and have an internal de- 
sign which follows the Booch meth- 
odology [2]. This methodology has 
improved the reuse of code among 
all of the members of the simulator 
family. 

C+ + and the Booch methods offer 
several features that facilitate the 
implementation of this set of simula- 
tors. This is especially important 
when one considers the diversity of 
microprocessors modeled and that 
some of these microprocessors offer 
primitives which are unavailable 
using more conventional program- 
ming languages. 

The PowerPC Architecture sup- 
ports both signed and unsigned 64- 
bit integers. However, all of Moto- 
rola’s PowerPC simulators will be 
hosted initially on 32-bit architecture. 
Moreover, conventional languages 
(e.g., C) do not typically offer a 64-bit 
integer type. C + + allowed us to de- 


fine a 64-bit integer class and to de- 
fine operators on that class that make 
using 64-bit integers as simple (from 
a syntactical point of view) as the 
usual 32-bit integers. 

The ability to define an abstract 
register class has allowed us to largely 
ignore the differences between 32-bit 
and 64-bit instructions in the source 
code of the instruction handlers. 
Where the instruction handlers do 
differ in semantics from implemen- 
tation to implementation, they are 
declared as virtual functions. The 
customization of such functions can 
then be incrementally added and 
maintained. 

The ability to derive support 
classes for the internal interfaces used 
in the timing simulators allowed spe- 
cialized classes (representing micro- 
processor functional units) to take 
advantage of the interfaces (and con- 
structors) of the ancestor classes. This 
feature of C + + simplified the con- 
struction of a simulated microproces- 
sor and avoided many of the pro- 
gramming problems associated with 
command parsing and distribution. 

The PowerPC Architectural 
Simulator 

The PowerPC Architectural Simula- 
tor (PPCArch) family is a set of pro- 
grams that emulate various PowerPC 
microprocessors (see Figure 1). Cur- 
rently, the PowerPC 603 and 604 
microprocessors are simulated by dif- 
ferent versions of PPCArch, and 
other models will be simulated in the 
future. These are instruction set ar- 
chitecture (ISA) simulators. This 
means that no modeling of timing 
effect is done by the simulators and 
each instruction simulated is assumed 
to complete before the next instruc- 
tion starts: there is no simulation of a 
pipeline in PPCArch. The goal is to 
run PowerPC object code as accu- 
rately and as quickly as possible. 

The PPCArch simulators are based 
on a type of simulator originally de- 
veloped for the Motorola M88K 
RISC microprocessor by Robert 
Bedichek for Tektronix [1]. Bedichek 
developed a style of threaded code 
simulator that used a unique C lan- 
guage and assembly-code macro 
function to emulate each instruction 
in the 88100. He was able to decode 
an instruction once and use the de- 


COMM UNICATIONS OF THE ACM June 1994/V61.37, No.6 65 



Tcl-based command line interpreter 


FPU 

FXU 

Fetch/ 

branch 

ICache 

B 

1 

U 

Dispatch / 
completion 

IMMU 

FPR 

GPR 

DMMU 

Load / store 

DCache 

PowerPC 603 



Real 

memory 


Bus 

device 


Bus 

device 


Motherboard 


Tel 

support 

routines 


Figure i. The high-level compo- 
nents of PPCArch 

Figure 2. Basic components of 
PPCSim603 


coded form many times, depending 
on locality of code reference and size 
of simulated instruction cache. He 
was also able to simulate the 88K vir- 
tual machine sufficiently to boot Unix 
on the simulator. The performance 
of this simulator was also very im- 
pressive: an average of 130,000 in- 
structions per second when hosted 
on a 2.5MIPS 68020 Tektronix 
workstation. 

Like Bedichek’s simulator, each 
instruction has its own simulation 
function (the instruction handler) in 
PPCArch. However, unlike Bedichek 
we implement our handlers in port- 
able C++ as opposed to assembler 
macros or nonportable language fea- 
tures. We are willing to trade execu- 
tion speed for maintainability and 


portability as PPCArch will be hosted 
on multiple architectures. We also 
cache our decoded instructions, but 
we use an abstract cache of arbitrary 
size as opposed to using a model of 
the actual instruction cache. 

The basic design of the physical 
memory system in PPCArch was also 
inspired by Bedichek’s design. We 
have a physical memory manager 
which allocates physical memory 
pages by an on-demand (i.e., “lazy 
allocation”) scheme. The user is able 
to allocate a physical memory system 
of a fixed size without paying the po- 
tentially high start-up costs of allocat- 
ing large chunks of host heap mem- 
ory. Instead, the physical pages are 
allocated as they are touched. 

An interrupt and bus simulation 
module provides the interface be- 
tween the physical memory manager 
and the caches. The interrupt/bus 
manager has the ability to attach at 
run time an arbitrary number of 
memory-mapped I/O device simula- 


tor modules to specified ranges of 
physical address. Whenever such an 
address range is read from or written 
to, the appropriate memory-mapped 
handler is invoked with the address 
of the access, the type of access (read, 
write or execute) and the data (if ap- 
propriate) as arguments. By way of a 
simple deferred interrupt manager 
and an interface to the real memory 
system, the memory-mapped handler 
can simulate the actions of a memory- 
mapped DMA device without know- 
ing the details of how the simulator 
works. In this way, device drivers 
(that part of an OS kernel which di- 
rectly manages hardware devices) can 
be tested by writing simple memory- 
mapped handlers to simulate the as- 
sociated devices. 

The interrupt/bus module also sup- 
ports symmetric multiprocessing sim- 
ulation configurations. In this config- 
uration, cpu, effective address trans- 
lation and cache modules are 
instantiated as a unit and associated 
with the interrupt/bus module. The 
interrupt/bus module provides sup- 
port for snooping between the caches 
and physical memory. This config- 
uration will allow OS developers 
to debug multithreaded operating 
systems. 

The effective address translation 
module, similar to the physical mem- 
ory module, has a lazy allocation 
scheme for its structure maintenance. 
However, the effective address trans- 
lation module’s primary function is to 
maintain and to accelerate the effec- 
tive —* virtual — > physical translation 
and to support the semantics of the 
PowerPC effective memory system. 
Both memory systems use sparse in- 
verted tree structures to implement 
the address searches. 

Memory-mapped handlers (similar 
to the instruction handlers and the 
I/O simulation handlers) are used 
to implement traps and memory- 
resident breakpoints. The breakpoint 
handlers are set using a user-defined 
code which is returned to the CLI 
when the user hits the breakpoint. 
The user is then free to perform 
whatever manipulations are appro- 
priate to the breakpoint code 
returned. 

The full virtual machine model is 
primarily implemented with the in- 
struction handlers and with special 


66 June 1994/V61.37, No.6 communications of the acm 





xtied ranges of 
hr never such an 
from or written 
nemory-mapped 
ith the address 
k of access (read, 
Tie data (if ap- 
:nts. By way of a 
rrupt manager 
je real memory 
napped handler 
ns of a memory- 
■ ithout know- 
c the simulator 
device drivers 
ernel which di- 
are devices) can 
imple rnemory- 
amulate the as- 


;odule also sup- 
processing sim- 
In this config- 
address trans- 
modules are 
and associated 
module. The 
provides sup- 
eef the caches 
. This config- 
f S developers 
led operating 


>s translation 
f hysical mem- 
izy allocation 
maintenance, 
address trans- 
function is to 
•ate the effec- 
al translation 
panties of the 
mory system. 

: sparse in- 
I implement 


dlers (similar 
Hers and the 
s are used 
d memory- 
e breakpoint 
i vser-defined 
to the CLI 
creakpoint. 
to perform 
are appro- 
point code 


ae model is 
ith the in- 
ith special 


The Making of the 



The effective address translation module - ' primary function 


is to maintain and accelerate the effective — ►rirtual — ►physical 


translation and to support the semantic > : f the PowerPC 


effective memory system. 


andler functions that are associated 
« ih the supervisor visible control 
r roisters. This allows us to treat con- 
d registers as uniformly as possible 
i Tie allowing the diversity of func- 
~ :<n associated with them. 

AIX emulation is accomplished by 
_ 'in g a monolithic memory model 
id a special trap handler which is 
.'Sociated with the system call trap 
rctor. When a system call trap is en- 
; ; untered, control is transferred to 
tfie system call trap handler. This 
landler then examines the appropri- 
te general-purpose registers and 
emulates the system call using the 
host operating system. The necessary 
stack space for the user program is 
-lapped separately using CLI rou- 
mes. Although not all of the AIX sys- 
:em call interface is provided, provi- 
on is made for this set of calls to be 
rxtended by the user. 

The code for PPCArch (which sup- 
ports both the 603 and 604 PowerPC 
vicroprocessor models) comprises 
mproximately 49,000 lines of non- 
. imment, nonblank C++ code. 


"he PowerPC 603 Timing 
Simulator 

The PowerPC 603 Timing Simulator 
PPCSim603) is the first of a series of 
simulators that accurately model both 
:he instruction set semantics and the 
tietailed pipeline behavior of a 
PowerPC microprocessor implemen- 
tation. The basic design emphasis is 
'imilar to the design of the PPCArch 
simulators, although the organiza- 
tion of the simulator internal to the 
TPU module is significantly different 
from the architectural simulators. In 
particular, the organization of 
?PCSim603 is based on the microar- 
chitectural features of the actual 
PowerPC 603 microprocessor, as 


opposed to the architecture of the 
PowerPC instruction set as a wh le 
Therefore, the structure 
PPCSim603 is organized around 
functional units and the synch.: - 
nized communication among func- 
tional units. 

Internal to each simulated 
PowerPC 603 microprocessor in 
PPCSim603 are functional units 
which correspond to the actual func- 
tional blocks in a physical PowerPC 
603 microprocessor. Each of these 
functional units is implemented as a 
C + + class and has a specific proce- 
dural interface appropriate t its 
functional type. This procedurti in- 
terface comprises its public member 
functions. For example, the bus inter- 
face unit (BIU) will have fur.::: 
interfaces which allow it to com m u n - 
cate directly with the caches an d *■ it 
the external bus. A unit also ner. es 
functional interfaces from the classes 
from which it is derived. This use f 
C + + inheritance greatlv facilitates 
the implementation of clock and 
command distribution net --. :>rks 

Synchronization of the functi r _ 
units is accomplished by use : f a she - 
ulated two-phase clock. A tuner: rai 
unit may make no assume : ns re- 
garding its order of invocari n 
respect to any other funedona in 
within a clock phase. The cloci 
daily toggled by the user from a com- 
mand to the bus. The bus distri bks 
the clock signal to all of its bus de- 
vices, which include the mem n as- 
tern and any PowerPC 603 mkr p r - 
cessors attached to it. Whhan tr 
simulated PowerPC 6‘titi . d> e* hm 
clock signal may be multiplied and 
then redistributed from the BT 
all remaining clocked simulat r func- 
tional units. The code within a mo - 
tional unit (for instance, the : ore 


o . might simulate a floating-point 
: : tr-.oei is responsible for its 

5 nchronization. 

functional unit may receive 
_ mmoao.rs directly from the com- 
mand line. The basic syntax of this 
rn—iinil is: verb unit arg . . . 

Term is either display or 
_ _ tutttir is the pathname to the 
nal unit of interest (for in- 
motherboard. cpu3.fpu) 
it g is an arbitrary list of ar- 
g cuts which is passed into the 
l al unit for further interpreta- 
. tvs simple control command 
u: eatlv simplifies command 
: v - u The dispatch network that 
dc ns the delivery of these com- 
. is is managed transparently to 
- functional unit implemented To 
: _ : receive commands, a func- 

.. unit is simply derived from the 
v. liable C+ + class, and its vir- 
: net ions display and modify 

: t : n e d to suit the particular func- 
txxval unit. 

??C Sim603 has an event mecha- 
v:ch allows the end user to in- 
jmiment a simulator configuration. 
E ents cc ur when the simulator 
anges state. An end user can easily 
« me his or her own program to ana- 
ze the events that are generated 
: a simulation run. Events are 
generated by predefined code inter- 
□ al t : the functional unit modules 
«::h:n PPCSim603. Each functional 
unit can individually enable or dis- 
_ Tr generation of events. There is 
alsc a global event enable switch 
ithin the CLI. Examples of event 
■ ires include cache hits and misses, 
us cvcles, instruction pipeline state 
changes and so on. 

The event’s value is a character 
string, with colon-separated name = 
value pairs. For instance, a hypothet- 


mONS OF THE ACM June 1994/Vol.37, N0.6 67 




Big of the PowerPC 


ical instruction event may have the 
value: 

evtype 

= instnopcode 
= 0 x 48000000:unit 
= fxu:addr 

= OxfOOOrstage = finlshrclock 
= 10 

In this example, there are six 
name = value pairs in the event. In 
order, they indicate that the event 
type is an instruction event, the hexa- 
decimal opcode is 0x48000000, 
the event takes place in the fixed- 
point unit, the address of the instruc- 
tion is OxfOOO, the stage of the in- 
struction pipeline indicates that the 
instruction is finished and that the 
clock count is 10. This form of event 
is particularly simple to parse and 
evaluate in Tel code. 

When an enabled event occurs, a 
global Tel procedure called an event 
handler is called with the value of the 
event. Within this routine, a user may 
filter events, create histograms, 
gather and process instruction statis- 
tics and so forth. The default event 
handler merely echoes the value of 
event string to the command line. 

Events have proved to be ex- 
tremely useful during the develop- 
ment of PPCSim603. Events have 
been used for internal debugging. 
Instruction trace formatters have 
been written using the instruction 
event as a generation mechanism. 
Finally, events have been used to ver- 
ify the timing accuracy of 
PPCSim603. In this last application, a 
stream of instruction events is parsed 
and an instruction timing histogram 
(using Tel’s associative arrays) is con- 
structed during the simulation of a 
program. This histogram is then 
compared to a histogram generated 
from an instrumented version of the 
PowerPC 603 hardware models, and 
the variances were analyzed to detect 
timing inaccuracies. This verification 
of timing accuracy would be much 
more difficult to implement without 
the event mechanism. 

Interactions between the PPCSim- 
603 development and test teams 
made it clear that error handling, 
similar to event handling, is a policy 
decision best left to the end user. Al- 
though an error could be considered 
to be just another kind of event, it was 


decided that having a separate han- 
dler for events made the most sense 
for a casual simulator user. There- 
fore, an error handler routine (analo- 
gous to the event handler) is called 
when an error condition arises dur- 
ing a simulation run. Error state in- 
formation is held in two global Tel 
variables (errorCode and er- 
rorlnfo), which can be interrogated 
from within the body of the error 
handler routine. Errors fall into two 
general categories: 

• an error condition may be related 
to an inappropriate request to the 
underlying operating system (for 
example, the user may try to open a 
nonexistent file), or 

• a simulator command (e.g. display 
or modify) is in error. 

In either case, the error is analyzed 
and additional information is pre- 
sented in errorCode and errorlnfo 
for use by the user’s error handler. 

The code for PPCSim603 com- 
prises approximately 45,000 lines of 
noncomment, nonblank C++ code. 

The PowerPC 603 
Functional Model 

Neither PPCArch nor PPCSim603 
directly addresses the needs of the 
system hardware designer. However, 
PPCSim603 was carefully designed to 
preserve the interfaces which are 
present in a real microprocessor sys- 
tem. In particular, the PowerPC 603 
CPU module was driven by a clock 
distributed from the simulated bus, 
and the memory system was acces- 
sible only from that bus. The ad- 
herence to these hardware-oriented 
interfaces by the PPCSim603 imple- 
menters made the CPU code from 
PPCSim603 especially adaptable to 
other simulation environments. In- 
deed, one of the requirements of the 
PPCSim603 design was that the CPU 
code could be adapted to an industry- 
standard hardware simulation envi- 
ronment by means of a software 
“wrapper.” The ability to adapt code 
to run in a different simulation envi- 
ronment also leverages the many 
engineer-months of testing that go 
into bringing a simulator to market. 

PPCSim603 was adapted to run 
under the Cadence programming 
language interface (PLI) for the 
Verilog simulation environment [3, 


10]. This conversion was accom- 
plished by replacing a particular 
C + + class (the Pin class) with a vari- 
ant class that implemented the PLI 
interface. The real-memory system 
was also adapted to run under the 
PLI interface. Neither the CPU nor 
the real-memory system can detect 
whether its environment is stand- 
alone (i.e., PPCSim603) or Verilog. 

The CLI code from PPCSim 603 
was also included in the module 
which runs under the PLI interface. 
This is not strictly necessary, since all 
of the semantics of the processor are 
accessible from the Pin interface. 
However, it proved especially conve- 
nient to do this, since much of the test 
code for PPCSim 603 was written in 
Tel and this code required the CLI in 
order to run. 

This adaptation of the basic 
PPCSim603 code from standalone 
simulator to embedded hardware 
functional model took approximately 
three engineer months of effort. 
Much of this effort was spent in learn- 
ing the Verilog PLI. This PLI has 
been developed for the C language, 
but the PPCSim603 CPU module is 
written entirely in C + + . Although 
there is a mechanism in C + + for 
dealing with C externals, the reverse 
situation is not true. In particular, 
some of the object-oriented semantics 
associated with C+ + classes (the con- 
structors for static objects) caused 
some initial problems. This problem 
was solved by having a C + + main( ) 
routine call back the Verilog PLI 
main( ) routine (which was renamed 
vmain( ) and declared an extern 
“C” function). All compilation was 
done by the native C + + compiler 
and all linkage done by the native 
linker. The initial development was 
done on an IBM RS/6000 workstation 
running AIX, and used the xlC C + + 
compiler. 

Approximately 900 lines of C 
wrapper code were necessary to im- 
plement the hardware environment 
interface. In addition, approximately 
1,500 lines of (conditionally com- 
piled) C++ code was added to the 
source code base for PPCSim603. 
There are currently no plans to turn 
this PowerPC 603 functional model 
into a product, since there are other 
alternatives which satisfy this market 
segment. The design constraints that 


68 June 1994/Vol.37, N0.6 COMMUNICATIONS OF THE ACM 


permit a standalone simulator to be 
adapted to a hardware simulator PLI 
■fill prove valuable nonetheless. This 
ct restraint will allow Motorola to pro- 
□ce PLI-compliant hardware simu- 
iition modules much earlier in the 
microprocessor design cycle than was 
previously possible. 

: jmmary 

??CArch and PPCSim603 are the 
nrst members of a family of simula- 
tors developed by Motorola’s RISC 
r. ftware group. They are intended to 
serve a broad audience of computer 
s\ stem software and hardware 
engineers. 

PPCArch is an architectural simu- 
lator that supports both the PowerPC 
6 33 and PowerPC 604 microproces- 
s 3 r models. It is useful to system and 
application software engineers as a 
general simulation environment. It 
has been used to simulate hundreds 
•f application programs for a variety 
f operating systems, and is being 
used by OS developers to port OS 
kernels to the PowerPC Architecture. 

PPCSim603 is the first of Moto- 
rola’s family of cycle-accurate timing 
simulators for PowerPC microproces- 
sors. It has a number of innovative 
features, including a user-program- 
mable command line interpreter, a 
bus interface for user-defined 
memory-mapped devices, a user- 
extensible OS emulator and flexible 
event and error-handling facilities. 
Using an object-oriented design ap- 
proach, the CPU core of PPCSim- 
603 was successfully adapted to run 
in Cadence’s Verilog environment 
with a minimum of additional 
engineering. 

PPCArch and PPCSim603 are the 
first major projects in M&MTG RISC 
Software to be implemented using 
object-oriented design techniques 
and using the C++ language. The 
software technology developed in 
these projects will see much reuse in 
future Motorola software simulation 
products. Q 


AIX, PowerPC, RS/6000 and xlC are trademarks 
of IBM. 


PPCArch, PPCSim and PPCSim603 are trade- 
marks of Motorola. 


Unix is a trademark of AT&T Bell Laboratories. 


Verilog is a trademark of Cadence Design Sys- 
tems, Inc. 

References 

1. Bedichek, R. Some efficient architec- 
ture simulation techniques. In Pro- 
ceedings of Winter USENIX, 1990. 

2. Booch, G. Object Oriented Design with 
Applications. Benjamin/Cummings, 
Calif., 1991. 

3. Cadence Design Systems, Inc., Pro- 
gramming Language Interface Reference 
Manual for Verilog. Vol. 1 and 2, 1992. 

4. Ellis, M. and Stroustrup, B. The Anno- 
tated C+ + Reference Manual. Addison- 
Wesley, Reading, Mass., 1990. 

5. Lippman, S. C++ Primer Second ed. 
Addison-Wesley, Reading, Mass., 
1991. 

6. Ousterhout, J. Tel: An embeddable 
command language. In Proceedings of 
Winter USENIX, 1990. 

7. PowerPC User Instruction Set Architec- 
ture, Book I, Version 1.04, May 4, 
1993. 

8. PowerPC Virtual Environment Architec- 
ture, Book II, Version 1.04, May 4, 
1993. 

9. PowerPC Operating Environment Archi- 


tecture, Book III, Version 1.04, May 4, 

1993. 

10. Voith, R. The PowerPC 603 C+ + 
Verilog interface model. In Proceedings of 
Spring Compcon ’ 94 , San Francisco, 

1994. 

About the author: 

WILLIAM ANDERSON is presently the 
manager of the PowerPC microprocessor 
simulator development team in the RISC 
software group at Motorola in Austin, 
Texas. His interests include simulation 
techniques, computer architecture, com- 
puter arithmetic algorithms and VLSI 
implementation techniques. Author’s 
present address: Motorola RISC Software 
Group, Mail Drop OE112, 6501 William 
Cannon Drive West, Austin TX 78735; 
email: wca@bird.sps.mot.com 

Permission to copy without fee all or part of this 
material is granted provided that the copies are not 
made or distributed for direct commercial advantage, 
the ACM copyright notice and the title of the publi- 
cation and its date appear, and notice is give that 
copying is by permission of the Association for 
Computing Machinery. To copy otherwise, or to 
republish, requires a fee and/or specific permission. 

© ACM 0002-0782/94/0300 $3.50 





Thanks to you, all sorts of everyday products are being made 
from recycled materials. But to keep recycling working, you need to 
buy those products. For a free brochure, call 1-800-CALL-EDF. 


tlMMU 



COMMUNICATIONS OF THE ACM June 1994/Vol.37, No.6 69 




